Back to Articles

How NYU Built a Leaderboard to Track LLM Agents Hacking Their Way Through CTF Challenges

[ View on GitHub ]

How NYU Built a Leaderboard to Track LLM Agents Hacking Their Way Through CTF Challenges

Hook

The latest competitor on NYU's cybersecurity CTF leaderboard isn't a human hacker—it's an LLM agent. And it's not alone: this repository tracks how well AI systems can autonomously exploit vulnerabilities, crack passwords, and capture flags.

Context

Evaluating LLM capabilities has become a benchmark arms race. We have SWE-bench for code generation, MMLU for general knowledge, and HumanEval for algorithmic thinking. But cybersecurity presents a unique evaluation challenge: CTF (Capture The Flag) competitions require multi-step reasoning, tool usage, exploitation knowledge, and the ability to adapt when initial approaches fail. A traditional benchmark with static test cases doesn't capture the interactive, exploratory nature of hacking.

NYU researchers created a leaderboard to track LLM agent performance on 200 real CTF challenges spanning categories like binary exploitation, web security, cryptography, and reverse engineering. But rather than building a complex automated testing infrastructure, they opted for something surprisingly simple: a GitHub repository where researchers manually submit their results via pull request. The NYU-LLM-CTF/leaderboard_submissions repository is the backend that powers their public leaderboard at NYU-LLM-CTF.github.io, serving as both a submission system and a historical archive of how autonomous hacking capabilities evolve across different models and agent architectures.

Technical Insight

Leaderboard Generation

Submission Structure

Create submission folder

Pull Request

Pull Request

Merge approved submission

Aggregates all summaries

Powers public site

Researcher Fork

transcripts/team_model_date/

summary.json

Optional logs & metadata

Main Repository

generate_leaderboard.py

leaderboard.json

NYU-LLM-CTF.github.io

System architecture — auto-generated

The architecture is refreshingly minimal. The entire system revolves around a standardized directory structure under transcripts/ where each submission gets its own folder. Inside that folder, researchers include a summary.json file containing binary success/failure results for each of the 200 challenges, along with full conversation logs and metadata in whatever format they prefer.

The canonical submission structure looks like this:

transcripts/
├── team_name_model_date/
│   ├── summary.json          # Required: standardized results
│   ├── metadata.json         # Optional: model details, prompts
│   ├── logs/                 # Optional: full conversation traces
│   │   ├── 2023q-pwn-puffin.txt
│   │   ├── 2023q-web-sqlmaster.txt
│   │   └── ...
│   └── README.md            # Optional: methodology notes

The summary.json file is the only strictly required component, and it follows a dead-simple schema:

{
  "team": "OpenAI Research",
  "model": "gpt-4-turbo",
  "date": "2024-01-15",
  "results": {
    "2023q-pwn-puffin": {"success": true, "flag": "flag{buffer_overflow_master}"},
    "2023q-web-sqlmaster": {"success": false},
    "2023q-crypto-rsa": {"success": true, "flag": "flag{weak_primes_ftw}"}
  }
}

This design choice—requiring only a minimal JSON summary while allowing arbitrary supplementary data—is brilliant in its pragmatism. Researchers can use any agent framework (AutoGPT, LangChain, custom implementations), any logging format, and any prompting strategy. The repository doesn't enforce tool choices or execution environments. It simply says: "Tell us what worked, and show your work however you want."

The aggregation happens via generate_leaderboard.py, a Python script that walks the transcripts/ directory, parses each summary.json, and outputs a single leaderboard.json file. The script is straightforward—about 150 lines of Python that essentially performs:

import json
from pathlib import Path

def aggregate_submissions():
    leaderboard = []
    
    for submission_dir in Path('transcripts').iterdir():
        if not submission_dir.is_dir():
            continue
            
        summary_path = submission_dir / 'summary.json'
        if not summary_path.exists():
            continue
            
        with open(summary_path) as f:
            data = json.load(f)
            
        total_challenges = len(data['results'])
        solved = sum(1 for r in data['results'].values() if r.get('success'))
        
        leaderboard.append({
            'team': data['team'],
            'model': data['model'],
            'date': data['date'],
            'score': solved,
            'total': total_challenges,
            'percentage': (solved / total_challenges) * 100
        })
    
    leaderboard.sort(key=lambda x: x['score'], reverse=True)
    
    with open('leaderboard.json', 'w') as f:
        json.dump(leaderboard, f, indent=2)

The 200 challenges use canonical naming like 2023q-pwn-puffin or 2024q-web-xss-chaos, likely referring to specific CTF events (the "q" probably denotes quarter). This naming convention ensures submissions reference the same challenges consistently, even though the actual challenge files aren't hosted in this repository—it's purely for results tracking.

The fork-and-PR workflow is standard GitHub collaboration: researchers fork the repo, add their submission folder, and open a pull request. The repository README explicitly mentions using git clone --depth 1 for shallow clones, acknowledging that as the archive grows with full conversation logs from multiple LLM agents attempting 200 challenges, the repository size will balloon. This is a submission archive that will only grow, never shrink.

What makes this architecture particularly clever is what it doesn't do. There's no automated testing infrastructure, no sandboxed execution environment, no challenge hosting, and no validation logic. It's purely trust-based: the research community submits honest results, and the maintainers merge pull requests after presumably spot-checking the submissions. This works because the target audience is academic researchers who benefit from transparent, reproducible reporting, not competitors gaming a prize pool.

Gotcha

The binary success metric is both a strength and a limitation. You either captured the flag or you didn't—there's no credit for getting 90% of the way there, no measurement of efficiency, and no tracking of how many attempts or how much time the agent consumed. An agent that solves a challenge in 5 messages versus 500 messages gets the same score. This masks important differences in agent quality and makes it impossible to distinguish between "barely solved it after exhaustive brute force" and "elegantly exploited the vulnerability on the first try."

The trust-based submission model is a potential vulnerability. There's no apparent verification system—no automated re-running of submissions, no requirement to provide reproducible execution environments, and no validation that the provided flags are correct. A malicious or careless submitter could inflate their scores, and while academic reputation serves as a deterrent, the infrastructure itself doesn't prevent gaming. As the leaderboard gains prominence and if incentives change (research grants, recruiting attention, etc.), this becomes a more serious concern. The repository also doesn't preserve the actual CTF challenge files, making it impossible for future researchers to independently verify historical submissions if the original challenge sources become unavailable.

Verdict

Use if: You're developing LLM agents for cybersecurity tasks and want to benchmark against established baselines on a standardized set of 200 CTF challenges, or you're researching autonomous agent capabilities in adversarial domains and need a historical dataset of how performance has evolved. This is the right venue for publishing your cybersecurity agent results to gain academic visibility. Skip if: You need the actual CTF challenges to train or test your agent (they're not hosted here), you want nuanced metrics beyond binary success/failure, or you're looking for implementation details of high-performing agents (you'll need to chase down the submitters' actual codebases). This is infrastructure for results reporting, not a complete benchmarking platform—treat it as the leaderboard backend it claims to be, not a turnkey evaluation suite.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/llm-engineering/nyu-llm-ctf-leaderboard-submissions.svg)](https://starlog.is/api/badge-click/llm-engineering/nyu-llm-ctf-leaderboard-submissions)