CVE-Bench: Testing Whether AI Agents Can Actually Hack
Hook
Within months of ChatGPT's release, researchers began asking an uncomfortable question: can these AI systems autonomously exploit security vulnerabilities? CVE-Bench provides the first systematic answer using real CVEs instead of toy problems.
Context
The security community has long relied on capture-the-flag competitions, synthetic benchmarks, and carefully constructed test environments to evaluate penetration testing tools and train security professionals. But as large language models demonstrated increasing capability at code generation and technical reasoning, a new question emerged: could AI agents autonomously discover and exploit real vulnerabilities?
Existing benchmarks fell short of answering this question. CTF challenges are designed for human creativity and often require lateral thinking that doesn't translate to realistic attack scenarios. Synthetic vulnerabilities lack the complexity and environmental factors of production systems. Meanwhile, evaluating agents against actual live systems raises obvious ethical and legal concerns. CVE-Bench emerged from UIUC's Kang Lab to bridge this gap: a framework that uses authentic, historical CVEs from the National Vulnerability Database, packages them in reproducible Docker environments, and provides automated grading of exploitation outcomes. It's not asking whether an AI can solve puzzles—it's asking whether an AI can perform the actual work of a penetration tester.
Technical Insight
CVE-Bench's architecture centers on isolation and reproducibility. Each of the 40 critical-severity CVEs runs in its own Docker container with the vulnerable application, any required databases (PostgreSQL, MySQL), and simulated outbound servers for testing data exfiltration. The framework builds on Inspect AI, a model evaluation library, which provides standardized interfaces for different LLMs and agent architectures.
The evaluation comes in two flavors: 'zero_day' and 'one_day'. In zero-day mode, the agent receives only the target URL and must discover vulnerabilities through reconnaissance. In one-day mode, the agent gets the CVE identifier and description, simulating a scenario where vulnerability details are public but patches haven't been deployed. This distinction matters because it separates vulnerability discovery capability from exploitation capability.
Here's how you'd run an evaluation against a specific CVE:
from inspect_ai import eval
from inspect_ai.model import get_model
# Evaluate GPT-4 on CVE-2021-41773 (Apache path traversal)
result = eval(
"cve_bench/cve_2021_41773",
model=get_model("openai/gpt-4"),
variant="one_day", # or "zero_day"
max_steps=30,
sandbox="docker"
)
print(f"Exploitation success: {result.scores['exploit_success']}")
print(f"Criteria met: {result.scores['criteria']}")
The grading system defines eight concrete exploitation criteria: denial of service, arbitrary file read, remote code execution, database modification, database access, unauthorized login, privilege escalation, and outbound HTTP requests. Each CVE maps to one or more criteria based on its nature. The graders are automated Python functions that verify outcomes—checking if specific files were read, database records modified, or elevated commands executed.
What makes this particularly clever is how graders avoid requiring manual exploit implementation. Instead of comparing against a reference solution, they verify postconditions. For a SQL injection vulnerability, the grader doesn't check if the agent used the exact payload—it checks whether the database contains evidence of successful injection. For RCE, it verifies whether a specific file was created or command output captured. This approach prevents data contamination (agents can't memorize solutions) while remaining robust to different exploitation paths.
The Docker isolation is critical for safety. Each environment is ephemeral and network-isolated except for controlled outbound servers. Agents interact purely through HTTP requests, which the framework logs comprehensively. Here's a simplified view of how an agent might interact with a vulnerable application:
import requests
# Agent attempting SQL injection on login form
payload = "admin' OR '1'='1' -- "
response = requests.post(
"http://vulnerable-app:8080/login",
data={"username": payload, "password": "anything"}
)
# Framework grader checks for unauthorized access
if "Welcome, admin" in response.text:
# Unauthorized login criterion satisfied
score = 1.0
The framework also captures full agent traces—every HTTP request, response, and reasoning step. This transparency is essential for understanding how agents approach exploitation, what techniques they attempt, and where they fail. Researchers can replay traces to debug both agent behavior and grader logic.
One architectural choice worth noting: CVE-Bench deliberately doesn't provide the vulnerable source code to agents. This mirrors real-world penetration testing where attackers have black-box or limited gray-box access. Agents must infer application behavior from HTTP responses, error messages, and side channels. This significantly raises the difficulty bar compared to code-analysis benchmarks where the agent can directly inspect vulnerable functions.
Gotcha
The biggest limitation is architectural compatibility. CVE-Bench officially supports only amd64 systems, with experimental and often problematic arm64 support. If you're running Apple Silicon or ARM-based cloud instances, expect to spend time wrestling with Docker emulation or simply failing to build certain container images. This isn't a small inconvenience—it's a fundamental barrier that excludes a growing segment of the development community from using the benchmark.
The lack of released exploit solutions is both a feature and a frustration. While withholding manual exploits prevents data contamination and keeps the benchmark valid as LLMs train on increasingly broad datasets, it makes verification difficult. If your agent fails to exploit a CVE, you can't easily compare against a reference implementation to understand whether the failure stems from agent limitations or potential grader bugs. The automated graders are well-designed, but without ground truth exploits, debugging edge cases becomes speculative. Educational use cases also suffer—security researchers who want to understand exploitation techniques won't find them here.
Finally, the web-only focus is a significant scope limitation. Real-world penetration testing involves privilege escalation on operating systems, exploiting binary vulnerabilities, cryptographic attacks, social engineering vectors, and infrastructure misconfigurations. CVE-Bench's exclusive focus on web application CVEs means you're only evaluating one dimension of offensive security capability. An agent that scores well here might still fail completely at binary exploitation or network penetration scenarios.
Verdict
Use CVE-Bench if you're researching AI safety in offensive security contexts, building autonomous penetration testing tools, or evaluating whether language models pose genuine exploitation risks. The real-world CVE basis makes it the gold standard for authentic vulnerability exploitation assessment, and the Docker isolation provides safe, reproducible evaluation infrastructure. It's particularly valuable if you're publishing research on AI agent capabilities and need a credible, non-synthetic benchmark that the security community will respect. Skip it if you're on ARM architecture without patience for emulation hassles, need transparent exploit solutions for debugging or education, or want to evaluate agent capabilities beyond web application security. Also skip if you're building general-purpose coding agents—this benchmark tests a very specific, adversarial skill set that doesn't generalize to normal software development tasks.