Back to Articles

Building an LLM Evaluation Framework That Won't Burn Your API Budget

[ View on GitHub ]

Building an LLM Evaluation Framework That Won’t Burn Your API Budget

Hook

Running a full MMLU evaluation against GPT-4 can cost hundreds of dollars and take hours. Interrupt it halfway through, and you’ve just burned cash for nothing—unless you’ve architected your evaluation framework with resumability from day one.

Context

LLM evaluation is expensive, both in time and money. If you’re iterating on prompts, comparing models, or running benchmark suites like MMLU, you’ll quickly discover that API-based evaluation has a fundamental problem: it’s stateless. Every time you restart an evaluation run—whether because of a rate limit, a bug in your evaluation logic, or simply wanting to add more test cases—you’re back to zero, re-prompting the same questions and re-paying for the same tokens.

Most teams either write brittle one-off scripts or jump straight to heavyweight frameworks like EleutherAI’s lm-evaluation-harness, which is excellent for local models but overkill if you’re primarily working with OpenAI and Anthropic APIs. The jplhughes/evals_template repository fills this gap: it’s an opinionated Python template that treats API calls as expensive operations requiring caching, provides Hydra-based configuration for composable experiments, and includes rate limit optimizations that maximize throughput without getting throttled. It’s the evaluation framework you’d build yourself after burning through a few hundred dollars in wasted API calls.

Technical Insight

Optional

Hit

Miss

response

Hydra Config YAML

model/prompt/dataset

Evaluation Runner

Cache Lookup

hash prompt+params

Results & Metrics

Rate Limit Manager

model rotation

LLM APIs

OpenAI/Anthropic

Cache Storage

SQLite/JSONL

Experiment Dir

configs/prompts/history

W&B Logging

fine-tuning

System architecture — auto-generated

The architecture centers on three key components: a Hydra configuration system for experiment management, a caching layer that persists API responses to disk, and a rate limit manager that rotates through model variants to maximize throughput.

The Hydra integration is particularly elegant. Rather than hardcoding model names and prompts in Python, you define composable YAML configs. Want to compare GPT-4 against Claude? Just override the model config. Testing prompt variations? Swap the prompt config. Here’s how the configuration structure looks:

# config/config.yaml
defaults:
  - model: gpt-3.5-turbo
  - prompt: zero_shot
  - dataset: mmlu

# Run with: python run_eval.py model=gpt-4 prompt=few_shot

The caching mechanism is where things get interesting. Every API call is hashed based on the prompt content and model parameters, then stored in a local SQLite database (or simple JSON files). When you restart an evaluation, the framework checks the cache first. This means you can interrupt a 10,000-question MMLU run after 3,000 completions, fix a bug in your scoring logic, and resume without re-paying for those first 3,000 API calls. The implementation uses a straightforward decorator pattern:

class CachedLLMClient:
    def __init__(self, cache_dir: Path):
        self.cache_dir = cache_dir
        self.cache_file = cache_dir / "responses.jsonl"
        self._load_cache()
    
    def _cache_key(self, prompt: str, model: str, temperature: float) -> str:
        content = f"{prompt}|{model}|{temperature}"
        return hashlib.sha256(content.encode()).hexdigest()
    
    def complete(self, prompt: str, model: str, **kwargs) -> str:
        cache_key = self._cache_key(prompt, model, kwargs.get('temperature', 0.0))
        
        if cache_key in self.cache:
            return self.cache[cache_key]
        
        response = self._api_call(prompt, model, **kwargs)
        self._write_cache(cache_key, response)
        return response

The rate limit optimization is clever. OpenAI enforces rate limits per model, but gpt-3.5-turbo, gpt-3.5-turbo-0125, and gpt-3.5-turbo-1106 are treated as separate models with independent rate limits. The framework lets you specify a list of model variants, then round-robins requests across them to effectively multiply your throughput. Combined with configurable thread pools, you can saturate your rate limits without triggering 429 errors.

The framework also includes built-in cost tracking. Every API response logs token counts and calculates costs based on current pricing, giving you real-time visibility into how much each experiment costs. This is invaluable when you’re comparing whether a GPT-4 run is worth 10x the cost of GPT-3.5-turbo for your specific use case.

For teams doing fine-tuning, there’s an integrated pipeline that uploads training data to OpenAI’s fine-tuning API and logs metrics to Weights & Biases. Helper CLI tools handle the annoying parts like managing remote training files and monitoring job status. The separation between evaluation and fine-tuning workflows is clean—each lives in its own module with shared utilities for data formatting and API interaction.

One underappreciated feature is the human-readable prompt history. Beyond the cache, the framework logs every prompt and response pair to timestamped text files. When your evaluation shows unexpected results, you can grep through these logs to find the exact prompts that caused problems, something that’s surprisingly difficult in frameworks that only expose aggregate metrics.

Gotcha

The most significant limitation is provider support. This template is built exclusively for OpenAI and Anthropic APIs. If you need Azure OpenAI, Cohere, Google’s PaLM, or local models via Ollama or vLLM, you’ll need to write custom client wrappers. The abstraction layer exists (BaseLLMClient), but only two implementations are provided.

Dataset support is similarly limited. MMLU is the primary built-in evaluation dataset. While you can add custom datasets by implementing the data loader interface, there’s minimal documentation on how to do this. If you need HellaSwag, TruthfulQA, or domain-specific benchmarks, expect to spend time reading the existing MMLU loader and reverse-engineering the expected format. The framework doesn’t include common evaluation metrics beyond accuracy—if you need BLEU, ROUGE, or semantic similarity scores, you’re implementing those yourself.

The Hydra dependency, while powerful, has a learning curve. If your team isn’t familiar with Hydra’s override syntax and config composition, there will be initial friction. The template also lacks extensive documentation on extending the framework—the README covers basic usage, but advanced customization requires reading the source code.

Verdict

Use if: You’re running API-based evaluations primarily against OpenAI or Anthropic, you need to iterate quickly on prompts and models without burning API credits on redundant calls, you value experiment reproducibility and want configuration-driven workflows, or you’re comfortable with Python and prefer opinionated templates over building from scratch. This is particularly strong for teams doing cost-sensitive research or production prompt testing where interrupted runs are common.

Skip if: You need support for local models or additional API providers beyond OpenAI and Anthropic, you require extensive built-in evaluation datasets and metrics (lm-evaluation-harness is better here), you want a no-code solution or prefer GUI-based tools, or you need a mature framework with comprehensive documentation and a large community. Also skip if you’re doing one-off evaluations where caching and configuration management are overkill—in that case, simple scripts or OpenAI’s simple-evals are more appropriate.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/llm-engineering/jplhughes-evals-template.svg)](https://starlog.is/api/badge-click/llm-engineering/jplhughes-evals-template)