Inside NYU’s LLM CTF Leaderboard: How Researchers Benchmark AI Agents on Hacking Challenges
Hook
While most AI benchmarks measure language understanding, NYU’s CTF leaderboard tests whether LLMs can actually break into systems—and requires you to show every command they tried.
Context
The rise of autonomous LLM agents has sparked a critical question in cybersecurity: can AI systems find and exploit vulnerabilities without human guidance? Traditional benchmarks like HumanEval test coding ability, but writing exploits requires a different skillset—understanding binary formats, analyzing assembly code, crafting payloads, and iteratively debugging failed attacks. The NYU-LLM-CTF/leaderboard_submissions repository addresses this gap by providing a standardized submission system for researchers evaluating their LLM agents against 200 Capture The Flag challenges.
Unlike automated leaderboard platforms that run your code in sandboxed environments, this repository takes a radically transparent approach: fork the repo, run your agent locally, submit your complete conversation transcripts via pull request. This design choice prioritizes reproducibility and analysis over convenience. Every command execution, every LLM response, every failed exploit attempt becomes part of the public record. For cybersecurity researchers, this transparency is valuable—knowing why an agent succeeded or failed reveals whether it truly understands exploitation techniques or simply got lucky with prompt injection.
Technical Insight
The architecture of leaderboard_submissions reflects a pragmatic understanding of academic research workflows. At its core, the system uses a decentralized submission model where the repository serves as a data lake rather than an evaluation platform. Each research team maintains their own agent infrastructure, runs evaluations locally, and submits results under the transcripts/ directory with a folder structure like transcripts/my-team-agent/. The only mandatory file is summary.json, which follows a strict schema:
{
"metadata": {
"agent": "CyberGPT-v2",
"comment": "pass@5 with RAG retrieval",
"model": "gpt-4-0125-preview",
"link": "https://github.com/example/cybergpt",
"date": "2024/03/15"
},
"results": {
"2023q-pwn-puffin": true,
"2023q-web-sqlchallenge": false,
...
}
}
This JSON structure captures the essential metadata—which model, which agent architecture, when it ran—while the results object maps all 200 challenge canonical names to boolean success indicators. The canonical naming system comes from the nyuctf package, which provides a CTFChallenge.canonical_name method ensuring consistency across submissions. This is crucial because CTF challenge names can be ambiguous (multiple competitions might have a “baby’s first buffer overflow” challenge), but canonical names like “2023q-pwn-puffin” uniquely identify each test case.
The transcript requirements reveal the repository’s research-first philosophy. Beyond the structured JSON, teams must include complete conversational logs showing the LLM’s reasoning process. The README specifies three mandatory elements: conversational history with initial prompts and command outputs, generation timestamps, and success indicators. The format itself is flexible—you could use JSON, YAML, plain text, or structured logs—as long as you document it in your submission’s README. This flexibility acknowledges that different agent architectures produce different log formats. A ReAct agent might log thought-action-observation cycles, while a multi-agent system might track inter-agent communication.
The aggregation mechanism is equally straightforward: generate_leaderboard.py walks the transcripts/ directory, validates each summary.json against the schema, and produces leaderboard.json—a single file consumed by the leaderboard webpage. Importantly, this generated artifact is excluded from pull requests. Contributors run the script locally to verify their submission parses correctly, but the maintainers regenerate the canonical leaderboard after merging. This prevents merge conflicts and ensures the leaderboard remains a pure function of the transcript data.
The repository’s reliance on Git for submission management has architectural implications. The README explicitly recommends git clone --depth 1 because full conversation histories across multiple submissions can make the repository large. This shallow clone pattern suggests the maintainers expect substantial repository growth. The fork-and-PR workflow also creates a natural rate limit: each submission requires manual review, preventing spam while allowing maintainers to verify that transcripts genuinely represent the claimed results.
Gotcha
The manual submission process is both the repository’s strength and its potential weakness. Requiring pull requests ensures human oversight and prevents gaming the leaderboard, but it doesn’t scale as easily as automated systems. There’s no automated validation pipeline checking whether your transcripts actually contain the commands you claim to have run, or whether those commands could have plausibly solved the challenges. The README provides no guidelines for maintainers on how to verify submissions beyond schema validation.
The repository also lacks infrastructure for reproducing results. You’re submitting logs of past runs, not executable agent code. If your transcripts show your agent solved a challenge using a specific exploitation technique, there’s no mechanism for other researchers to re-run your agent to verify the approach generalizes. The link field in metadata might point to your agent’s codebase, but there’s no requirement that the linked repository provides reproducible evaluation scripts. This makes the leaderboard more of a results registry than a true benchmark—useful for comparing published numbers, but less valuable for understanding why certain approaches work. The repository size issue mentioned in the README (hence the shallow clone recommendation) may require additional management strategies as submissions accumulate.
Verdict
Use this repository if you’re actively researching LLM agents for cybersecurity tasks and want to benchmark against the 200 challenges in the dataset. The submission process is straightforward—run your evaluation locally, format your results, open a PR—and participating in the leaderboard can support academic publications. The transparent transcript requirement is actually a feature if you’re doing rigorous research, since it forces you to document your methodology thoroughly. Skip it if you need real-time automated evaluation, are building production security tools rather than research prototypes, or want a benchmark with built-in reproducibility infrastructure. The manual PR workflow makes rapid iteration less convenient, and the lack of automated verification means you’re relying on maintainer review and community trust. For academic labs publishing papers on LLM cybersecurity capabilities, this leaderboard serves its purpose. For anyone else, the overhead may not be justified unless you specifically need these CTF challenges as your evaluation set.