OpenAI Evals: Building a Declarative Framework for LLM Benchmarking
Hook
While most evaluation frameworks require you to write Python classes for every benchmark, OpenAI Evals flipped the paradigm: 90% of their evaluations are defined in YAML files that non-programmers can write, with LLMs themselves serving as judges.
Context
When OpenAI released GPT-3.5 and GPT-4, they faced a scaling problem that every AI company now wrestles with: how do you measure if your latest model is actually better? Traditional ML metrics like perplexity don't capture whether a model gives helpful answers. Human evaluation is the gold standard but doesn't scale—you can't have humans rate thousands of responses every time you tweak a prompt or release a model update. The solution was counterintuitive: use AI to evaluate AI.
OpenAI Evals emerged from this need as both an internal tool and a community contribution platform. Rather than building yet another Python testing framework where every benchmark requires custom code, they created a declarative system where evaluations live as YAML configurations paired with JSON datasets. This architectural choice lowered the barrier for contribution—subject matter experts could create domain-specific benchmarks without writing evaluation logic. The framework has since accumulated 18,000+ GitHub stars and serves dual purposes: helping developers assess their own LLM applications and crowdsourcing diverse benchmarks that reveal model blindspots.
Technical Insight
The genius of Evals lies in its registry pattern architecture. At the core is a three-layer system: eval templates (reusable evaluation logic), eval specs (YAML configurations), and sample datasets (JSON files stored via Git-LFS). This separation means you can create sophisticated benchmarks by composing existing components.
Here's a minimal eval specification that tests a model's ability to translate English to French:
french-translation:
id: french-translation.dev.v0
description: Tests basic English to French translation
metrics: [accuracy]
french-translation.dev.v0:
class: evals.elsuite.basic.match:Match
args:
samples_jsonl: french-translation/samples.jsonl
The corresponding samples.jsonl file contains test cases:
{"input": [{"role": "system", "content": "Translate to French"}, {"role": "user", "content": "Hello"}], "ideal": "Bonjour"}
{"input": [{"role": "system", "content": "Translate to French"}, {"role": "user", "content": "Goodbye"}], "ideal": ["Au revoir", "Adieu"]}
The Match class handles exact string matching, including support for multiple acceptable answers (notice "Goodbye" accepts two valid translations). You register this eval and run it against any OpenAI model:
oaieval gpt-4 french-translation
This produces structured output with accuracy metrics, token usage, and per-sample results logged to a local SQLite database or Snowflake for team analysis.
But exact matching only works for closed-ended questions. The real innovation is model-graded evaluation. For subjective tasks—Is this summary coherent? Is this response helpful?—Evals uses a stronger model (often GPT-4) as a judge. Here's a model-graded eval for creative writing:
creative-writing-quality:
id: creative-writing-quality.dev.v0
description: Evaluates creativity and engagement in short stories
metrics: [accuracy]
creative-writing-quality.dev.v0:
class: evals.elsuite.modelgraded.classify:ModelBasedClassify
args:
samples_jsonl: creative-writing/samples.jsonl
eval_type: cot_classify
modelgraded_spec: creativity
The creativity spec defines how the judge model should evaluate responses:
creativity:
prompt: |
You are evaluating a short story for creativity and engagement.
[BEGIN DATA]
***
[Prompt]: {input}
***
[Response]: {completion}
***
[END DATA]
Does this story demonstrate creativity? Answer Y or N.
choice_strings:
- "Y"
- "N"
This pattern scales remarkably well. OpenAI used it internally to evaluate GPT-4's improvements over GPT-3.5 across thousands of scenarios. The eval registry now includes benchmarks for logic puzzles, medical question answering, code generation, and even detecting model hallucinations.
For advanced use cases beyond simple prompt-response pairs, Evals implements a Completion Function Protocol (CFP). This abstraction lets you evaluate multi-turn agents, retrieval-augmented generation systems, or tool-using workflows. You implement a CompletionFn that wraps your system:
from evals.completion_fns.base import CompletionFn
class RAGCompletionFn(CompletionFn):
def __init__(self, retriever, model="gpt-4"):
self.retriever = retriever
self.model = model
def __call__(self, prompt, **kwargs):
# Retrieve relevant context
context = self.retriever.search(prompt[0]["content"])
# Augment prompt with context
augmented_prompt = prompt + [
{"role": "system", "content": f"Context: {context}"}
]
# Call underlying model
return self.call_api(augmented_prompt)
This lets you benchmark your entire RAG pipeline—not just the language model—using the same eval infrastructure.
The data storage strategy deserves attention too. Evaluation datasets often contain thousands of examples with lengthy text, which would bloat the git repository. Evals uses Git-LFS (Large File Storage) to version control these JSON files separately. When you clone the repo, you get pointer files; the actual data downloads lazily when accessed. This keeps the repo snappy while maintaining versioning guarantees.
Gotcha
The framework's tight coupling to OpenAI's API is both a feature and a limitation. Every eval run costs money in API calls, and you need internet connectivity. If you're evaluating frequently during development or running large benchmark suites, costs can escalate quickly. There's no built-in support for local models or other providers like Anthropic or Cohere without significant adapter code.
OpenAI also stopped accepting custom evaluation code contributions—only template-based and model-graded evals are allowed now. This quality control measure makes sense (they don't want to review arbitrary Python code), but it limits the framework's extensibility for public contributions. If your evaluation logic doesn't fit their templates, you'll need to fork the repo or maintain private evals. There's also a documented issue where eval runs occasionally hang at completion, requiring you to manually kill the process. The data is saved, but it's an annoying papercut in the developer experience.
Verdict
Use Evals if you're building on OpenAI's models and need systematic performance tracking across versions, want to leverage model-graded evaluation without building that infrastructure yourself, or value a declarative approach where product managers and domain experts can create benchmarks without writing code. It's particularly powerful for teams iterating on prompts and system messages who need quantitative evidence of improvements. Skip it if you need model-agnostic evaluation across multiple providers, require evaluations that run locally without API costs, want bleeding-edge extensibility to implement novel evaluation paradigms not covered by the templates, or are bothered by coupling your evaluation infrastructure to a single vendor's API. For those cases, consider Promptfoo for lightweight CLI-first evaluation or LangSmith for broader LLM application observability with provider flexibility.