New benchmarks show semantic code graphs helping coding agents find change locations faster and complete updates more ...
Migrated PaperBench code-only grading that runs entirely on a local machine (1×node, 8×AMD MI300X), using a local SGLang-served model as the judge over an OpenAI-compatible API — instead of TRAPI / ...
You can now configure and run Evals directly in the OpenAI Dashboard. Get started → Evals provide a framework for evaluating large language models (LLMs) or systems built using LLMs. We offer an ...