Back to Articles

Building Better LLM Benchmarks: Inside OpenAI's Evals Framework

[ View on GitHub ]

Building Better LLM Benchmarks: Inside OpenAI’s Evals Framework

Hook

The most impactful work you can do when building with LLMs isn’t prompt engineering or fine-tuning—it’s creating rigorous evaluations. Without them, you’re flying blind with every model update.

Context

As large language models evolved from research curiosities to production infrastructure, developers faced a critical challenge: how do you systematically measure whether a model update improves or degrades your specific use case? Traditional software testing falls short when outputs are probabilistic and quality is subjective. You can’t just assert that a completion equals an expected string when there are thousands of valid ways to answer a question.

OpenAI Evals emerged from this need for structured LLM evaluation. Rather than forcing teams to build bespoke testing harnesses for each use case, Evals provides a framework that separates evaluation data (stored in JSON and managed via Git-LFS), evaluation logic (defined in YAML templates or Python classes), and execution (CLI-based runners). The framework also serves a dual purpose: while you can run evals privately for your own development, OpenAI staff actively review community-contributed evals, creating a feedback loop between real-world benchmarks and the OpenAI team.

Technical Insight

Evaluation Patterns

defines

defines

provides

CLI Runner

Registry System

YAML Templates

JSONL Datasets

Git-LFS

Evaluation Engine

Basic Match Eval

Model-Graded Eval

Completion Function

Protocol

OpenAI API

Results Logger

Local Storage

Snowflake DB

System architecture — auto-generated

The architecture of Evals revolves around a registry system that makes creating basic evaluations surprisingly lightweight. At its simplest, you can create an eval without writing any Python—just a YAML configuration file and a JSONL dataset. The framework supports basic match-based evaluations that check if model output matches expected answers, as well as model-graded evaluations where you use an LLM to judge another LLM’s output for subjective qualities like helpfulness or reasoning quality.

For model-graded evals, you define evaluation criteria in YAML without custom code. The README mentions that the framework includes existing eval templates (detailed in eval-templates.md) and references the CoQA dataset implementation as an example showing different template approaches.

For complex scenarios like multi-turn conversations, prompt chains, or tool-using agents, Evals provides the Completion Function Protocol. The documentation describes this as supporting “advanced use cases” though specific implementation details would require consulting the completion-fns.md documentation.

Data management deserves special attention. The registry uses Git-LFS to handle large evaluation datasets efficiently. When you clone the repo, you’re initially getting pointer files rather than full datasets. You explicitly fetch the data you need:

git lfs fetch --include=evals/registry/data/${your eval}
git lfs pull

This design prevents the repository from ballooning to gigabytes while still providing access to comprehensive benchmark datasets. The separation between data (JSONL files), configuration (YAML), and logic (Python classes) means you can version control all three independently and share evals without exposing sensitive data.

Results can be logged locally or optionally sent to a Snowflake database for centralized tracking across teams by setting the appropriate environment variables (SNOWFLAKE_ACCOUNT, SNOWFLAKE_DATABASE, SNOWFLAKE_USERNAME, SNOWFLAKE_PASSWORD). The framework appears to handle execution and metric calculation based on your eval template configuration.

Gotcha

The framework’s current contribution policy is its most significant limitation for community engagement: OpenAI is not accepting evals with custom code, only model-graded evaluations defined purely in YAML. This makes sense for security and maintainability on their end, but it severely limits what the community can contribute. If your evaluation requires custom logic—parsing structured outputs, interacting with external systems, or implementing domain-specific scoring—you can build it for private use (the README confirms you can create private evals representing your workflow patterns), but you can’t contribute it back to the registry.

The framework also has a confirmed bug where processes hang after completing evals. According to the README FAQ: “When I run an eval, it sometimes hangs at the very end (after the final report)” and “This is a known issue, but you should be able to interrupt it safely and the eval should finish immediately after.” It’s a minor annoyance, but it breaks automated pipelines and can be confusing for new users who aren’t sure if they should wait or kill the process.

The framework requires Python 3.9 minimum and assumes you’ll be setting up an OpenAI API key via the OPENAI_API_KEY environment variable. The README explicitly reminds users to “be aware of the costs associated with using the API when running evals,” which can accumulate quickly with large evaluation suites.

Verdict

Use OpenAI Evals if you’re building applications with OpenAI models and need systematic, version-controlled evaluation that goes beyond ad-hoc testing. The framework excels when you want to contribute benchmarks that OpenAI staff might review, or when you need model-graded evaluations that assess subjective quality dimensions. It’s particularly valuable for teams that want to separate evaluation data from code, enabling non-technical stakeholders to contribute test cases via JSONL without touching Python (as the README emphasizes, “you don’t need to write any evaluation code at all” for template-based evals).

Skip it if you require extensive custom evaluation logic for community contributions (currently not accepted), or if the OpenAI API cost model doesn’t fit your evaluation budget. For quick one-off experiments, lighter-weight alternatives might serve you better, though the README does mention integration with Weights & Biases as an alternative execution environment. The framework makes strong opinions about architecture—YAML configs, Git-LFS data, template-based evaluation—which are assets when they align with your needs but friction when they don’t. The known hanging bug and minimum Python 3.9 requirement are minor considerations but worth noting for production CI/CD pipelines.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/llm-engineering/openai-evals.svg)](https://starlog.is/api/badge-click/llm-engineering/openai-evals)