Back to Articles

CyberGym: Benchmarking AI Security Agents Against 10 Terabytes of Real CVEs

[ View on GitHub ]

CyberGym: Benchmarking AI Security Agents Against 10 Terabytes of Real CVEs

Hook

Most AI security benchmarks test agents on synthetic bugs. CyberGym does the opposite: it’s a large-scale evaluation framework built on real CVE vulnerabilities—complete with compilation environments totaling ~10TB—that rigorously assesses whether AI agents can analyze and exploit authentic security flaws.

Context

The AI security research community faces challenges in benchmarking agent capabilities against realistic vulnerability scenarios. Many existing benchmarks use toy problems or simplified challenges that may not reflect actual security work.

CyberGym, developed by researchers at UC Berkeley’s Sunblaze lab, addresses this gap by using real CVEs from projects like Apache Avro and Google’s OSS-Fuzz corpus. Each task includes the complete build environment, vulnerable code, patched code, and a verification pipeline that checks whether submitted Proof-of-Concept exploits actually trigger the vulnerability. This is a research-grade benchmarking framework designed to rigorously assess AI agents’ capabilities on real-world vulnerability analysis tasks.

Technical Insight

The architecture revolves around three components working in concert: a task generation system, a server-side verification pipeline, and containerized execution environments. The task generator creates standardized challenge packages from the curated vulnerability dataset. Each task bundles the vulnerable repository snapshot, metadata about the CVE, difficulty classification, and submission tooling.

The submission workflow is straightforward. After generating a task, you receive a tarball with the vulnerable code, a description, and a submission script:

# Generate a task for a specific CVE
SERVER_IP=localhost
SERVER_PORT=8666
TASK_ID='arvo:10400'
OUT_DIR=./cybergym_tmp
CYBERGYM_DATA_DIR=./cybergym_data/data

python3 -m cybergym.task.gen_task \
    --task-id $TASK_ID \
    --out-dir $OUT_DIR \
    --data-dir $CYBERGYM_DATA_DIR \
    --server "http://$SERVER_IP:$SERVER_PORT" \
    --difficulty level1

# Results in:
# ./cybergym_tmp/
# ├── description.txt
# ├── README.md
# ├── repo-vul.tar.gz
# └── submit.sh

# Submit a PoC
echo -en "\x00\x01\x02\x03" > $OUT_DIR/poc
bash $OUT_DIR/submit.sh $OUT_DIR/poc

The server component runs a dual validation check that makes CyberGym’s methodology rigorous. When you submit a PoC, the server executes it against both the vulnerable and patched versions of the code in isolated Docker containers. A valid exploit must demonstrate differential behavior: it should crash or fail on the vulnerable version but execute cleanly on the patched version. This dual validation eliminates false positives from inputs that simply cause crashes unrelated to the actual vulnerability.

The containerization strategy is notable. Each CVE gets its own Docker image containing the complete compilation environment, dependencies, and toolchain needed to build and run that specific version of the software. This is why the full dataset grows to ~10TB—you’re not just storing source code, you’re preserving entire build ecosystems frozen at specific points in time. The binary-only mode offers a compromise at ~130GB by stripping out compilation environments, suitable for static analysis tasks where you don’t need to build or run code dynamically.

The verification pipeline maintains a SQLite database tracking all submissions. After running your agent, you can programmatically verify results:

python3 scripts/verify_agent_result.py \
    --server http://$SERVER_IP:$SERVER_PORT \
    --pocdb_path $POC_SAVE_DIR/poc.db \
    --agent_id 8113f33401d34ee3ae48cf823b757ac7

This returns structured data about each PoC attempt, including exit codes for both vulnerable and fixed versions, allowing you to calculate success rates and analyze failure modes. The framework exposes an HTTP API for submissions, making it language-agnostic—your agent can be written in Python, Rust, or anything else that can make POST requests.

The task difficulty classification system segments challenges into levels, helping researchers progressively test agent capabilities. The curated subset of 10 tasks explicitly includes five that agents can successfully generate PoCs for and five that are more challenging, providing a calibrated starting point for evaluation without requiring the full ~10TB download.

Gotcha

The infrastructure requirements are substantial. The full ~10TB dataset is large and requires significant storage allocation. Even the base benchmark data requires ~240GB. The binary-only mode at ~130GB is more manageable but sacrifices dynamic analysis capabilities, limiting you to static approaches. If your research involves fuzzing, symbolic execution, or techniques requiring code compilation and execution, you need the full dataset.

The Docker dependency creates operational considerations. You need a host with resources to run compilation and execution jobs in containers. The framework doesn’t include built-in scheduling or resource management for parallel evaluations, so running large-scale experiments may require building your own orchestration layer.

The vulnerability dataset has specific focus areas. The README indicates tasks include real CVEs from Apache Avro and OSS-Fuzz projects, which may have certain characteristics common to fuzz-discoverable vulnerabilities. The framework is designed for rigorous assessment of AI agents on real-world vulnerability analysis tasks within its scope, though researchers should consider whether this scope aligns with their specific research questions.

Verdict

Use CyberGym if you’re conducting academic research on AI agent capabilities in vulnerability analysis, have access to substantial storage infrastructure (at minimum ~240GB for the base dataset, potentially ~10TB for full compilation environments), need reproducible benchmarks backed by peer-reviewed methodology (arXiv 2506.02548), or want to test agents against authentic CVEs from real projects. The framework excels at providing rigorous evaluations suitable for research publications. Skip it if you’re doing lightweight security tool development, lack the Docker and storage infrastructure, need a platform for training agents rather than evaluation, or require out-of-the-box orchestration for parallel testing at scale. This is a research evaluation framework designed to rigorously assess AI agents on real-world vulnerability analysis tasks.

// QUOTABLE

Most AI security benchmarks test agents on synthetic bugs. CyberGym does the opposite: it's a large-scale evaluation framework built on real CVE vulnerabilities—complete with compilation environmen...

[ Tweet This ]
// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/developer-tools/sunblaze-ucb-cybergym.svg)](https://starlog.is/api/badge-click/developer-tools/sunblaze-ucb-cybergym)