Back to Articles

Inside NYU's LLM CTF Leaderboard: Git as a Decentralized Benchmark Database

[ View on GitHub ]

Inside NYU’s LLM CTF Leaderboard: Git as a Decentralized Benchmark Database

Hook

Most benchmarking platforms build complex databases and APIs to manage submissions. NYU’s LLM CTF leaderboard uses Git commits as the database, pull requests as the submission system, and a 200-line Python script to generate the entire leaderboard.

Context

The explosion of LLM capabilities has researchers racing to apply these models to cybersecurity tasks, particularly Capture The Flag challenges that test everything from binary exploitation to web security. But comparing LLM agent performance across different approaches is nearly impossible without standardized benchmarks. Each research group runs their own evaluations on different challenge sets, uses different prompt strategies, and reports results in incompatible formats. When NYU released their CTF Dataset with 200 cybersecurity challenges specifically designed for LLM evaluation, they faced a choice: build a traditional benchmarking platform with databases, authentication, and automated validation, or embrace a radically simpler approach.

The NYU-LLM-CTF/leaderboard_submissions repository represents that simpler path. Rather than building infrastructure, they treat the Git repository itself as the submission database. Every submission lives in version control, every change is auditable through Git history, and the entire leaderboard generation happens in a single Python script that outputs static JSON. It’s benchmark infrastructure reduced to its essence: a folder structure, a JSON schema, and a PR workflow that every developer already understands.

Technical Insight

Pipeline

Submission Format

Fork & Create Folder

Required

Optional

Contains metadata

Walks & Validates

Aggregates

Consumed by

Researcher Submission

transcripts/team_name/

summary.json

Conversation Logs

Team Name

Model

Solve Rate

Timestamp

generate_leaderboard.py

leaderboard.json

GitHub Pages

Public Website

System architecture — auto-generated

The architecture is deceptively simple but cleverly designed. Submissions live in a transcripts/ directory where each team creates a folder named after their agent approach. Inside, a summary.json file contains standardized metadata while additional files hold the full conversation transcripts between the LLM and the CTF environment.

The summary.json schema enforces the minimal data needed for leaderboard ranking:

{
  "team_name": "research_group_identifier",
  "model": "gpt-4",
  "temperature": 0.7,
  "challenges_solved": 47,
  "total_challenges": 200,
  "solve_rate": 0.235,
  "timestamp": "2024-01-15T10:30:00Z",
  "transcript_files": ["conversation_001.txt", "conversation_002.txt"]
}

This format is intentionally flexible about transcript storage. Some teams submit plaintext conversations, others use structured JSON logs, and some include additional metadata like tool calls or reasoning traces. The only requirement is that conversations are timestamped and indicate whether challenges were successfully solved. This flexibility respects that different research groups have different logging infrastructures while maintaining comparability at the summary level.

The leaderboard generation happens in generate_leaderboard.py, which walks the transcripts/ directory, validates each summary.json, and aggregates results:

import os
import json
from pathlib import Path

def generate_leaderboard():
    submissions = []
    transcripts_dir = Path('transcripts')
    
    for team_dir in transcripts_dir.iterdir():
        if not team_dir.is_dir():
            continue
            
        summary_path = team_dir / 'summary.json'
        if not summary_path.exists():
            print(f"Warning: {team_dir.name} missing summary.json")
            continue
            
        with open(summary_path) as f:
            submission = json.load(f)
            submission['team_id'] = team_dir.name
            submissions.append(submission)
    
    # Sort by solve rate, then by timestamp for ties
    submissions.sort(
        key=lambda x: (x['solve_rate'], x['timestamp']), 
        reverse=True
    )
    
    with open('leaderboard.json', 'w') as f:
        json.dump(submissions, f, indent=2)

This script runs on every merge to main, regenerating the entire leaderboard from scratch. The resulting leaderboard.json gets consumed by the GitHub Pages site at NYU-LLM-CTF.github.io, which renders the rankings and provides links back to the transcript folders for full transparency.

The submission workflow leverages GitHub’s native PR system as a validation layer. Researchers fork the repository, add their submission folder, and open a pull request. Maintainers review the PR to verify the submission follows the schema, the solve rate calculations are accurate, and the transcripts demonstrate legitimate challenge completions. This manual review is intentional—it catches errors that automated validation would miss, like incorrectly counted solves or timestamps that don’t match transcript dates.

What makes this architecture clever is how it uses Git’s strengths while avoiding its weaknesses. Git provides versioning (every submission has a commit history), transparency (anyone can audit the raw data), and distributed backups (every fork is a complete copy). The shallow clone strategy (git clone --depth 1) mitigates repository size concerns for casual users who just want to see the leaderboard data. And because leaderboard generation is deterministic and stateless, anyone can regenerate the official leaderboard.json by running the script themselves—there’s no hidden database or privileged access.

The system also creates interesting incentives for reproducibility. Since transcripts are public and versioned, researchers can’t quietly update submissions after the fact. The Git history shows exactly when each submission was made and what changed over time. This transparency pressures teams to submit complete, accurate results rather than gaming the metrics, because the raw data is available for anyone to scrutinize.

Gotcha

The biggest limitation is scalability, both in terms of repository size and manual review bottlenecks. Every submission adds hundreds of kilobytes or megabytes of transcript data to the repository. After dozens of submissions, cloning becomes slow even with shallow clones, and GitHub’s repository size limits (1GB recommended, 5GB hard limit) start to loom. There’s no archival strategy for old submissions or compression for completed challenges.

The manual PR review process also doesn’t scale well. With 5-10 submissions, maintainers can carefully review each transcript and verify solve counts. With 100 submissions, this becomes a significant time burden and creates delays for submitters waiting for their results to appear on the leaderboard. The validation is also inconsistent—there’s no formal checklist or automated pre-flight checks, so different reviewers might enforce standards differently.

The lack of automated validation for transcript integrity is another blind spot. The system trusts that submitted solve counts match the transcripts, that timestamps are accurate, and that challenges were legitimately solved rather than answers being looked up. While Git history provides some accountability, there’s no cryptographic verification or sandboxed re-execution of the agent runs. A determined bad actor could fabricate impressive-looking results, and detection would rely on manual scrutiny or community reporting. For an academic benchmark where reputation matters, this might be acceptable, but it wouldn’t work for high-stakes competitions with prizes.

Verdict

Use if you’re building an academic benchmark where transparency and simplicity trump automation, where submission volume is measured in dozens not thousands, and where you trust your community to self-police. This architecture shines when Git’s versioning and GitHub’s PR workflow map naturally to your validation needs, and when your benchmark data is valuable enough that public archival in Git is a feature not a bug. It’s perfect for specialized research leaderboards tracking LLM agent performance on the NYU CTF Dataset specifically. Skip if you need private submissions, real-time leaderboard updates, automated re-execution of submissions for verification, or expect high submission volume that would overwhelm manual review. Also skip if your benchmark requires complex validation logic that goes beyond schema checking—the PR review bottleneck will frustrate submitters. And definitely skip if you’re building general CTF infrastructure rather than a submission portal for one specific dataset. For those cases, invest in proper competition platforms like CodaLab or build custom infrastructure with databases and API endpoints.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/llm-engineering/nyu-llm-ctf-leaderboard-submissions.svg)](https://starlog.is/api/badge-click/llm-engineering/nyu-llm-ctf-leaderboard-submissions)