Back to Articles

CVE-Bench: Testing Whether AI Agents Can Exploit Real-World Vulnerabilities

[ View on GitHub ]

CVE-Bench: Testing Whether AI Agents Can Exploit Real-World Vulnerabilities

Hook

What happens when you give GPT-4 a vulnerable web application and ask it to exploit a real CVE? CVE-Bench provides a comprehensive answer, and the results have implications for both AI safety and offensive security automation.

Context

The AI safety community has a problem: we don’t have good ways to measure whether language models can autonomously exploit security vulnerabilities. Most existing benchmarks use synthetic vulnerabilities, simplified CTF-style challenges, or educational platforms like WebGoat that bear little resemblance to real-world exploitation. Meanwhile, researchers building AI agents for penetration testing have no standardized way to evaluate their systems’ capabilities.

CVE-Bench, developed by the UIUC Kang Lab, fills this gap by creating an evaluation framework around 40 critical-severity CVEs from the National Vulnerability Database. Instead of testing whether agents can answer security trivia or solve artificial challenges, it asks: can an AI agent actually exploit a real vulnerability that affected production systems? The benchmark has already won recognition, taking second place in Berkeley RDI’s AgentX Competition (AI Safety & Alignment Research Track) and winning second prize in the SafeBench competition for ML Safety benchmarks. It was also accepted as a spotlight paper at ICML 2025. This isn’t academic theater—it’s a rigorous attempt to measure a capability that matters deeply for AI safety.

Technical Insight

CVE-Bench is built on the Inspect AI framework and uses Docker to orchestrate isolated environments for each vulnerability. Every CVE gets its own containerized web application in a vulnerable state, and the framework evaluates agents across eight distinct attack outcomes: denial of service, file access, remote code execution, database modification, database access, unauthorized administrator login, privilege escalation, and outbound service requests. This multi-dimensional approach means an agent isn’t just scored on “success” or “failure”—you can see exactly what types of attacks it can execute.

The evaluation workflow is straightforward. After installing dependencies with uv sync --dev, you can run evaluations against any model that Inspect supports:

./run eval --model=openai/gpt-4o-2024-11-20

This command spins up Docker containers for all 40 CVEs and evaluates the model across both “zero-day” and “one-day” variants. The zero-day variant provides minimal information—essentially what an attacker would know when first discovering a vulnerability. The one-day variant includes CVE details, simulating the scenario where vulnerability information has been publicly disclosed but patches haven’t been widely deployed. This distinction matters because it tests different threat models: sophisticated discovery versus rapid exploitation of known vulnerabilities.

Each challenge includes automated graders that verify exploitation success without manual validation. For instance, if the attack outcome is “remote code execution,” the grader checks whether the agent successfully executed the file at /tmp/pwn. For database modification, it verifies that data was actually changed. This automation is crucial for reproducible benchmarking—you can’t build a reliable benchmark if every evaluation requires a human security expert to manually verify whether exploitation succeeded.

The Docker-based architecture serves multiple purposes beyond convenience. First, it provides security isolation—you’re running actual exploitation attempts, and you need containment. Second, it ensures reproducibility. The vulnerable applications are frozen in known states, so evaluations are consistent across runs. Third, it makes the benchmark accessible. Researchers don’t need to set up complex vulnerable environments manually; they can pull pre-built images and start evaluating immediately.

Version 2.1.0 introduced a significant change: arbitrary file upload was removed as an evaluation criterion and replaced with remote code execution. According to the release notes, this change was informed by the Agentic Benchmark Checklist (ABC) and reflects a more meaningful security assessment—uploading a file is often just a step toward achieving RCE, which is the actual high-impact outcome attackers care about. The benchmark is actively evolving based on real-world security priorities rather than staying static.

For developers working on specific CVEs, the framework provides granular control. You can evaluate particular challenges:

./run eval --model=openai/gpt-4o-2024-11-20 -T challenges=CVE-2023-37999,CVE-2024-2771

This flexibility is essential for debugging agent behavior or focusing research on specific vulnerability classes. The repository also includes developer commands like ./run up TASK to start containers for manual testing, ./run sql-dump TASK to inspect database state, and ./run test-solution to verify that reference exploits still work against updated containers.

Gotcha

CVE-Bench’s limitations are substantial and directly tied to its design philosophy. The biggest constraint is architectural: the README explicitly recommends running on amd64 machines, with arm64 support marked as experimental. If you’re running Apple Silicon or another ARM-based system, you’re in for potential compatibility headaches. This isn’t a casual inconvenience—Docker images for vulnerable applications often depend on specific binaries and dependencies that may not have ARM equivalents or may behave differently.

The team deliberately chose not to release manual exploit solutions to prevent data contamination—they don’t want benchmark tasks leaking into training data for future models. While this makes sense from a benchmark integrity perspective, it creates friction for researchers. If your evaluation fails, you can’t easily compare your agent’s approach to a known working exploit. There’s one example solution for CVE-2024-2624 in the repository, but for the other 39 CVEs, you’re debugging blind. The README notes that “open-sourced graders are sufficient for evaluating models or agents,” but this asymmetry still favors researchers who already have strong offensive security expertise and can independently verify whether failures are due to agent limitations or environmental issues.

The infrastructure requirements are non-trivial. You need Docker expertise, enough system resources to run multiple containers simultaneously, and comfort with container orchestration. The README assumes you can navigate post-installation steps for running Docker as a non-root user on Linux (explicitly recommended), understand image building with Docker Buildx Bake, and debug container networking issues when they inevitably arise. This isn’t a benchmark you can casually run in a Jupyter notebook—it’s infrastructure-heavy evaluation that requires DevOps competence alongside AI research skills.

Verdict

Use CVE-Bench if you’re researching AI safety and need ground truth on whether language models can autonomously exploit real vulnerabilities, or if you’re building AI-powered penetration testing tools and need rigorous evaluation beyond anecdotal evidence. This is a comprehensive benchmark for measuring practical offensive security capabilities of AI agents, and the distinction between zero-day and one-day variants lets you test different threat scenarios systematically. The eight attack outcome categories give you nuanced insight into what types of exploitation your agent can achieve. Skip it if you’re doing general capability benchmarking unrelated to security, lack Docker infrastructure and expertise, or need fast-running evaluations for rapid iteration. The setup complexity is substantial, the amd64 requirement is restrictive (with arm64 support only experimental), and the absence of published exploit solutions makes debugging difficult unless you already have strong offensive security skills. This is a specialized tool for a critical but narrow research domain—evaluate whether you’re actually in that domain before investing the setup effort.

// QUOTABLE

What happens when you give GPT-4 a vulnerable web application and ask it to exploit a real CVE? CVE-Bench provides a comprehensive answer, and the results have implications for both AI safety and o...

[ Tweet This ]
// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/developer-tools/uiuc-kang-lab-cve-bench.svg)](https://starlog.is/api/badge-click/developer-tools/uiuc-kang-lab-cve-bench)