ModelRegression: Building a Daily LLM Benchmark That Tests What Developers Actually Use

Hook

When OpenAI quietly deprecated Codex in March 2023, developers lost access mid-project with no warning. ModelRegression exists because vendors won't tell you when their models get worse—so someone had to build the receipts.

Context

If you've used ChatGPT or Claude for coding over the past year, you've probably had that unsettling moment: the model that nailed a complex refactoring last month now fumbles basic syntax. You mention it to colleagues, they nod knowingly, but when you check the vendor's changelog? Nothing. 'No changes were made to the model,' the support ticket reads. You're left wondering if you're experiencing a statistical anomaly, confirmation bias, or actual regression.

This is the silent degradation problem that plagues production AI systems. Unlike traditional software where version numbers and changelogs document every shift, LLM providers can (and do) swap models behind stable API endpoints without disclosure. Sometimes it's A/B testing. Sometimes it's cost optimization. Sometimes it's safety filtering gone overboard. ModelRegression emerged from developer Dave's frustration with this opacity: a Python-based benchmark orchestrator that runs daily stress tests against frontier models and generates historical performance dashboards. The twist? It doesn't test APIs—it tests the CLI tools developers actually install and use.

Technical Insight

System architecture — auto-generated

ModelRegression's architecture makes a deliberate, controversial choice: it invokes vendor CLI tools (claude, codex, agent) as subprocess calls rather than hitting REST APIs directly. This means the benchmark captures the full developer experience—authentication flows, rate limiting, CLI bugs, even the latency of piping prompts through shell environments.

The orchestrator runs 30 handcrafted scenarios across 10 categories (long reasoning chains, bug detection, security awareness, code refactoring). Each test feeds a prompt to the CLI, captures stdout, then scores the response. Here's the simplified scoring flow:

import subprocess
import json
from datetime import datetime

def run_test(model_cli, test_prompt, expected_behavior):
    """Execute test via CLI and capture result"""
    try:
        result = subprocess.run(
            [model_cli, '--prompt', test_prompt],
            capture_output=True,
            text=True,
            timeout=60
        )
        
        if result.returncode != 0:
            return {"status": "outage", "score": 0}
        
        # LLM-as-judge: Use Claude Sonnet to score response
        score = evaluate_with_llm(
            response=result.stdout,
            expected=expected_behavior
        )
        
        return {
            "status": "success",
            "score": score,
            "timestamp": datetime.utcnow().isoformat()
        }
    except subprocess.TimeoutExpired:
        return {"status": "timeout", "score": 0}

def evaluate_with_llm(response, expected):
    """Score response using Claude Sonnet as judge"""
    judge_prompt = f"""
    Evaluate this code generation response on a 0-100 scale:
    
    Expected: {expected}
    Actual: {response}
    
    Return only a number.
    """
    # Hit Claude API for scoring (recursive irony here)
    judge_score = call_claude_api(judge_prompt)
    return float(judge_score)

The cron job runs daily at 3am, accumulating results into SQLite with schema (model, test_id, score, timestamp). The regression detector compares rolling 7-day windows against historical baselines:

def detect_regression(model_name, category):
    """Compare recent performance to historical baseline"""
    recent_avg = get_avg_score(model_name, category, days=7)
    baseline_avg = get_avg_score(model_name, category, days=30, offset=7)
    
    delta = ((recent_avg - baseline_avg) / baseline_avg) * 100
    
    if delta < -20:
        return {"severity": "critical", "drop": delta}
    elif delta < -10:
        return {"severity": "warning", "drop": delta}
    elif delta < -5:
        return {"severity": "minor", "drop": delta}
    else:
        return {"severity": "normal", "drop": delta}

Once scoring completes, the pipeline exports to static JSON files that feed a Next.js 15 frontend. The frontend is pure static generation—no runtime API calls, just pre-rendered charts from results.json. This architectural choice means the entire dashboard can be served from a CDN with zero backend infrastructure.

The health check system runs pre-flight tests before each benchmark cycle:

def preflight_health_check(model_cli):
    """Verify CLI is responsive before running full suite"""
    try:
        test = subprocess.run(
            [model_cli, '--version'],
            capture_output=True,
            timeout=10
        )
        return test.returncode == 0
    except:
        log_outage(model_cli)
        return False

This catches complete CLI failures (authentication expired, binary missing, vendor outage) but won't detect subtle behavior changes that still return exit code 0.

The most philosophically interesting decision is LLM-as-judge evaluation. For subjective tests like 'does this code follow security best practices,' there's no deterministic evaluator. So ModelRegression uses Claude Sonnet to score responses on 0-100 scales. This creates recursive irony: if the judge model regresses, all future scores shift, contaminating historical comparisons. A frozen judge model would solve this but loses the benefit of improved evaluation as models get smarter. It's a methodological tradeoff with no clean solution.

Gotcha

The CLI-based architecture is simultaneously ModelRegression's superpower and its Achilles heel. Testing through CLI tools captures real developer workflows—you're benchmarking the same claude binary you'd install via homebrew, complete with authentication quirks and version-specific bugs. But this creates brittle dependencies on vendor decisions outside your control. When OpenAI deprecated the Codex CLI in 2023, every test targeting it broke overnight. If Anthropic changes their CLI auth flow, your cron job starts failing with cryptic errors until you manually update credentials.

The bigger statistical elephant in the room: regression detection uses simple percentage thresholds (5%/10%/20%) without any statistical significance testing. LLMs are stochastic—sampling variance alone could cause a 7% score drop that triggers a 'warning' flag despite being pure noise. A proper implementation would run multiple samples per test, calculate confidence intervals, and only flag regressions that exceed 95% significance thresholds. As it stands, you'll get false positives from sampling variance and potentially miss real regressions masked by high-variance lucky runs.

The 30-test suite is also fundamentally limited by handcrafted scope. These scenarios reflect what the author considers important coding tasks, but can't possibly represent the distribution of real-world use cases. If your workflow centers on data science code rather than web APIs, or academic writing rather than bug fixes, these benchmarks may not correlate with your experience of model quality. Scaling to hundreds of tests would help but requires either procedural generation (harder to validate) or massive manual curation effort.

Verdict

Use ModelRegression if you're a solo developer or small team that relies heavily on frontier AI models through official CLIs and you've been burned by unexplained performance shifts. It's perfect for building evidence when you suspect silent regressions—instead of arguing about vibes, you can point to quantitative historical data. The architecture is simple enough to fork and customize with your own test scenarios, making it more valuable as a template than a definitive ranking system. Skip this if you need statistically rigorous benchmarking for academic research (use HELM or HumanEval instead), real-time alerting for production systems (the 24-hour cadence is too slow), or you primarily use model APIs rather than CLI tools (the subprocess architecture won't match your integration). Also skip if you need multimodal or agent benchmarking—this is laser-focused on code generation through text interfaces. ModelRegression won't replace comprehensive evaluation suites, but it might save you from gaslighting yourself when a model genuinely gets worse.

ModelRegression: Building a Daily LLM Benchmark That Tests What Developers Actually Use

ModelRegression: Building a Daily LLM Benchmark That Tests What Developers Actually Use

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

ModelRegression: Building a Daily LLM Benchmark That Tests What Developers Actually Use

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// RELATED

Harness-1: Training Search Agents with State Externalization

makemore: Understanding Language Models by Implementing Them Seven Different Ways

JARVIS: The LLM-Orchestrated AI System That Pioneered Multi-Model Task Automation

How Open-Assistant Built a ChatGPT Alternative with 160,000 Crowdsourced Conversations

Harness-1: Training Search Agents with State Externalization

makemore: Understanding Language Models by Implementing Them Seven Different Ways

JARVIS: The LLM-Orchestrated AI System That Pioneered Multi-Model Task Automation

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]