Back to Articles

Building Resumable LLM Evaluations: A Template for Rate-Limited API Testing

[ View on GitHub ]

Building Resumable LLM Evaluations: A Template for Rate-Limited API Testing

Hook

Running a thousand-prompt evaluation against GPT-4 at 2am, only to have your laptop die at prompt 847? Without proper caching, you just burned $40 in API credits for nothing.

Context

LLM evaluation is deceptively expensive. A single evaluation run with 1,000 prompts against GPT-4 can cost $50-100, and if your process crashes halfway through—whether from rate limits, network issues, or accidental terminal closures—you're back to square one, re-running (and re-paying for) prompts you've already tested. Unlike traditional software testing where reruns are free, every LLM API call hits your credit card.

The jplhughes/evals_template repository addresses this pain point by providing a Python framework that treats LLM evaluations as resumable, cacheable workflows. Built on Hydra for configuration management and designed specifically for OpenAI and Anthropic APIs, it implements the infrastructure patterns that most evaluation projects need: disk-based response caching, intelligent rate limiting that works around API quotas, cost tracking, and experiment history logging. While larger frameworks like LangSmith offer enterprise-grade evaluation platforms, this template provides the essential scaffolding for researchers and teams who want control over their evaluation pipeline without inheriting a complex dependency.

Technical Insight

The architecture centers on three core components: Hydra-managed YAML configurations, Pydantic-validated data models, and API wrapper modules with built-in caching. Let's examine how these pieces work together.

The Hydra configuration system enables composable experiment definitions. Instead of hardcoding model parameters or prompts in Python files, you define reusable YAML configs in the configs/ directory. A typical evaluation config might look like:

# configs/experiment/sentiment_eval.yaml
defaults:
  - /model: gpt-4
  - /prompts: sentiment_analysis

run_name: "sentiment_baseline_v1"
num_samples: 1000
enable_cache: true
track_costs: true

overrides:
  model:
    temperature: 0.0
    max_tokens: 50

This composability means you can swap models, adjust prompts, or modify sampling parameters through command-line overrides without touching code: python main.py experiment=sentiment_eval model=claude-3-opus. This pattern becomes invaluable when running multiple experiment variations—you're not maintaining brittle Python scripts with hardcoded values.

The caching mechanism is where the template shows real engineering thought. Each API response gets serialized to disk with a hash key derived from the prompt content and model configuration. The api_cache.py module implements this with a simple but effective design:

import hashlib
import json
import pickle
from pathlib import Path

class APICache:
    def __init__(self, cache_dir: str = ".cache/api_responses"):
        self.cache_dir = Path(cache_dir)
        self.cache_dir.mkdir(parents=True, exist_ok=True)
    
    def _generate_key(self, prompt: str, model_config: dict) -> str:
        cache_input = f"{prompt}_{json.dumps(model_config, sort_keys=True)}"
        return hashlib.sha256(cache_input.encode()).hexdigest()
    
    def get(self, prompt: str, model_config: dict):
        key = self._generate_key(prompt, model_config)
        cache_file = self.cache_dir / f"{key}.pkl"
        if cache_file.exists():
            with open(cache_file, 'rb') as f:
                return pickle.load(f)
        return None
    
    def set(self, prompt: str, model_config: dict, response):
        key = self._generate_key(prompt, model_config)
        cache_file = self.cache_dir / f"{key}.pkl"
        with open(cache_file, 'wb') as f:
            pickle.dump(response, f)

This means if you modify prompt 500 in your dataset and re-run the evaluation, only that single prompt gets re-executed—the other 999 responses are served from cache instantly. The cache key includes model configuration, so changing temperature or max_tokens correctly invalidates affected responses.

The rate limiting implementation takes a different approach than typical exponential backoff strategies. Instead of waiting progressively longer after each 429 error, the framework tracks your actual token-per-minute quota and paces requests to stay under the limit. The rate_limiter.py module maintains a sliding window of recent requests:

from collections import deque
from time import time, sleep

class RateLimiter:
    def __init__(self, tokens_per_minute: int, requests_per_minute: int):
        self.tpm_limit = tokens_per_minute
        self.rpm_limit = requests_per_minute
        self.token_history = deque()
        self.request_history = deque()
    
    def wait_if_needed(self, estimated_tokens: int):
        now = time()
        cutoff = now - 60
        
        # Remove entries older than 1 minute
        while self.token_history and self.token_history[0][0] < cutoff:
            self.token_history.popleft()
        while self.request_history and self.request_history[0] < cutoff:
            self.request_history.popleft()
        
        current_tokens = sum(t[1] for t in self.token_history)
        current_requests = len(self.request_history)
        
        # Calculate required wait time
        token_wait = 0
        if current_tokens + estimated_tokens > self.tpm_limit:
            token_wait = 60 - (now - self.token_history[0][0])
        
        request_wait = 0
        if current_requests >= self.rpm_limit:
            request_wait = 60 - (now - self.request_history[0])
        
        wait_time = max(token_wait, request_wait)
        if wait_time > 0:
            sleep(wait_time)
        
        self.token_history.append((time(), estimated_tokens))
        self.request_history.append(time())

The clever bit mentioned in the documentation is "doubling rate limits by alternating between model endpoints." OpenAI enforces rate limits per model, so by alternating requests between gpt-4-0613 and gpt-4-0314 (functionally identical models), you effectively double your throughput. The template's API modules support this through configuration without code changes.

Cost tracking is baked into the API wrapper classes. Each response includes token counts from the API, which get multiplied by per-token pricing (stored in configs/pricing.yaml) to maintain a running total. The main runner outputs real-time cost updates: Processed 247/1000 prompts | Est. cost: $12.34 | Avg latency: 1.2s.

The separation between inference evaluation (run_inference_eval.py) and fine-tuning workflows (run_finetune.py) is clean. Fine-tuning runs convert your prompts and expected outputs into JSONL format for OpenAI's fine-tuning API, then optionally log training metrics to Weights & Biases. This dual-mode design covers the full experimentation cycle: evaluate base models, prepare fine-tuning data from successful prompts, train custom models, then re-evaluate.

Gotcha

The template's tight coupling to OpenAI and Anthropic APIs means extending it to other providers (Cohere, AI21, local models) requires non-trivial modifications. The API modules in src/apis/ have hardcoded assumptions about response formats, authentication patterns, and rate limiting behavior specific to these two providers. Adding support for a provider with different semantics—say, a local vLLM deployment—would require creating new API wrapper classes and potentially rethinking the rate limiting strategy.

Documentation is minimal beyond setup instructions. There's no guide on structuring complex evaluation datasets, handling multi-turn conversations, or implementing custom scoring functions. The repository appears to be a personal template that the author uses for their own projects rather than a framework designed for public consumption. With only 7 GitHub stars and no visible community contributions, you won't find Stack Overflow answers or detailed tutorials when you hit edge cases. Error handling is basic—failed API calls get retried with exponential backoff, but there's no sophisticated handling of partial failures, no dead letter queues for problematic prompts, and no graceful degradation strategies. For production use cases where reliability is critical, you'd need to harden this significantly.

Verdict

Use if: You're starting a research or evaluation project with OpenAI or Anthropic models, you need basic infrastructure for caching and cost tracking without writing it from scratch, and you're comfortable forking code to customize it for your specific needs. This template provides solid bones for academic experiments, internal benchmarking, or proof-of-concept evaluations where you value resumability and want Hydra's configuration ergonomics. Skip if: You need production-grade reliability with comprehensive error handling, you're evaluating multiple LLM providers beyond OpenAI/Anthropic, you want extensive documentation and community support, or you prefer battle-tested dependencies over customizable templates. In those scenarios, invest in LangSmith for enterprise needs or PromptFoo for dedicated evaluation workflows—they've solved the hard problems this template only sketches.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/llm-engineering/jplhughes-evals-template.svg)](https://starlog.is/api/badge-click/llm-engineering/jplhughes-evals-template)