CyberGym: Building an AI Agent Benchmark on 10TB of Real Vulnerabilities

Hook

Most AI security benchmarks test agents on CTF challenges and synthetic bugs. CyberGym throws them into the deep end with 10 terabytes of real-world vulnerabilities from production codebases—and watches them sink or swim.

Context

The cybersecurity research community has a measurement problem. As AI agents become increasingly sophisticated at analyzing code and finding vulnerabilities, we lack standardized ways to evaluate their capabilities. Existing benchmarks rely heavily on Capture The Flag (CTF) challenges—gamified security puzzles that look nothing like the messy, undocumented vulnerabilities found in real software. An agent that excels at solving CTF challenges might completely fail when confronted with a real memory corruption bug in a million-line C++ codebase.

CyberGym, developed by researchers at UC Berkeley, takes a radically different approach. Instead of synthetic challenges, it builds evaluation tasks from actual CVEs and bugs discovered by OSS-Fuzz in open-source projects like Apache Avro. The framework provides agents with vulnerable application environments, asks them to analyze the code and produce proof-of-concept exploits, then rigorously validates those exploits by running them against both vulnerable and patched versions. It's the difference between practicing medicine on mannequins versus working in an emergency room—the learning curve is brutal, but the results actually matter.

Technical Insight

At its core, CyberGym implements a client-server architecture where the server manages containerized vulnerable applications and the agent interacts via HTTP API. The design elegantly solves two conflicting requirements: agents need enough access to perform vulnerability analysis (downloading dependencies, calling LLM APIs), but must be prevented from exfiltrating vulnerability data or accessing solutions.

The isolation mechanism centers on a domain-allowlist firewall implemented through Squid proxy. Every task container runs on an isolated Docker network where all outbound traffic routes through the proxy. The allowlist permits only essential domains—package managers (apt, pip, npm), LLM API endpoints (openai.com, anthropic.com), and documentation sites—while blocking everything else. Here's how an agent fetches a task and submits a proof-of-concept:

import requests
import base64

# Agent fetches vulnerability task from CyberGym server
response = requests.get('http://cybergym-server:8000/api/task/CVE-2023-1234')
task = response.json()

print(f"Target: {task['project_name']}")
print(f"Description: {task['vulnerability_description']}")
print(f"Environment: {task['docker_image']}")

# Agent analyzes code, develops exploit, then submits PoC
exploit_code = """
#!/usr/bin/env python3
import struct
import socket

# Trigger buffer overflow in vulnerable function
payload = b'A' * 256 + struct.pack('<Q', 0x41414141)
sock = socket.socket()
sock.connect(('target', 8080))
sock.send(payload)
"""

# Submit proof-of-concept for validation
submission = {
    'task_id': task['id'],
    'poc': base64.b64encode(exploit_code.encode()).decode(),
    'explanation': 'Buffer overflow in parse_header function'
}

result = requests.post(
    'http://cybergym-server:8000/api/submit',
    json=submission
)

print(f"Validation: {result.json()['status']}")

The validation system is where CyberGym's rigor becomes apparent. When an agent submits a PoC, the server spins up two containers: one with the vulnerable version, one with the patched version. It runs the PoC against both, checking exit codes and crash signatures. A valid exploit must trigger a crash (segfault, assertion failure, sanitizer detection) in the vulnerable version while the patched version continues running normally. This two-phase validation prevents false positives from generic DoS attacks or environment-specific crashes.

The dataset construction process reveals thoughtful engineering. Each task includes multiple representations: full compilation environments (complete source code, build systems, dependencies), binary-only modes (just the compiled vulnerable binary), and metadata (CVE IDs, difficulty ratings, patch diffs). The framework masks sensitive information in task descriptions—specific function names, line numbers, patch contents—to prevent agents from simply pattern-matching against known CVE databases. An agent genuinely must analyze the code.

CyberGym's resource management deserves attention. Running hundreds of vulnerable application containers simultaneously could exhaust system resources, so the framework implements task pooling and container lifecycle management. Tasks are fetched on-demand, containers are created with memory and CPU limits, and everything is torn down after evaluation. The SQLite database tracks submission history, preventing agents from repeatedly attempting the same task and enabling researchers to analyze agent behavior over time.

The binary-only mode is particularly clever for resource-constrained deployments. Instead of providing full compilation environments (which require gigabytes per task for source code, build dependencies, and toolchains), binary mode gives agents just the vulnerable executable and minimal runtime environment. This reduces storage from ~10TB to ~130GB while still testing the crucial skill of binary vulnerability analysis—arguably harder than source-level analysis since agents must work with disassembly and debug symbols.

Gotcha

CyberGym's infrastructure requirements will immediately disqualify it for many teams. The full dataset requires 10 terabytes of storage—that's 10,000 GB—plus substantial computational resources to run Docker containers for each evaluation task. Even the minimal binary-only subset demands 130GB. If you're running on a laptop or modest cloud instance, you're not getting started without significant infrastructure investment. The researchers acknowledge this by providing a 10-task subset for initial experimentation, but that tiny sample won't give you statistically meaningful results for benchmarking.

The framework's scope is narrower than the "cybersecurity evaluation" description might suggest. CyberGym focuses specifically on memory corruption vulnerabilities that produce detectable crashes—buffer overflows, use-after-free bugs, null pointer dereferences. It cannot evaluate agents on logic bugs, authentication bypasses, SQL injection, or any vulnerability class that doesn't reliably crash the application. A vulnerability that leaks sensitive data without crashing? CyberGym won't validate it. An authentication bypass that returns success when it should fail? No validation mechanism exists. This focus on crash-based bugs makes sense for automated validation, but it means the benchmark covers perhaps 30-40% of real-world vulnerability types. You're not getting a complete picture of an agent's security analysis capabilities, just their memory corruption hunting skills.

Verdict

Use CyberGym if you're conducting academic research on AI agent capabilities in vulnerability discovery, need reproducible benchmarks for comparing security-focused LLMs, or want rigorous evaluation of automated exploitation tools against real-world targets. The framework excels when you have substantial infrastructure, need standardized metrics for agent comparison, and focus specifically on memory corruption vulnerabilities. Skip if you're working with limited resources (the storage and compute requirements are punishing), need quick prototyping cycles (setup complexity is high), evaluating vulnerability types beyond memory corruption, or building production security tools rather than researching agent capabilities. CyberGym is a research instrument, not a practical security testing platform—approach it with the mindset of someone building a particle accelerator, not someone debugging their application.

CyberGym: Building an AI Agent Benchmark on 10TB of Real Vulnerabilities

CyberGym: Building an AI Agent Benchmark on 10TB of Real Vulnerabilities

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

CyberGym: Building an AI Agent Benchmark on 10TB of Real Vulnerabilities

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// RELATED

Caldera: When Your Red Team Needs a Planning Algorithm, Not Just Another C2

Caldera: Building Adversary Emulation with Fact-Based Planning Engines

Inside Mathias Bynens' Dotfiles: The Blueprint for 30,000 macOS Developer Environments

Glow: Why Rendering Markdown in the Terminal Shouldn't Require a Browser

Caldera: When Your Red Team Needs a Planning Algorithm, Not Just Another C2

Caldera: Building Adversary Emulation with Fact-Based Planning Engines

Inside Mathias Bynens' Dotfiles: The Blueprint for 30,000 macOS Developer Environments

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]