Back to Articles

SEC-bench: Automated Benchmarking for LLM Agents on Real-World Security Vulnerabilities

[ View on GitHub ]

SEC-bench: Automated Benchmarking for LLM Agents on Real-World Security Vulnerabilities

Hook

Most LLM coding benchmarks test whether AI can solve GitHub issues. SEC-bench asks a harder question: can your AI agent actually patch real-world memory safety vulnerabilities like those found in OpenJPEG?

Context

The explosion of LLM-powered coding agents has created a measurement problem. While benchmarks like SWE-bench evaluate agents on general software engineering tasks, they don’t specifically test security expertise—arguably the most critical domain where mistakes have real-world consequences. A coding agent that excels at general programming tasks might completely fail at understanding use-after-free vulnerabilities or crafting proof-of-concept exploits.

SEC-bench, accepted at NeurIPS 2025, addresses this gap with a framework that automates the entire pipeline from raw vulnerability data to reproducible security benchmarks. Instead of manually curating test cases, it mines the OSV database and CVE records, extracts vulnerability reports from GitHub and GitLab, generates Docker-based reproduction environments, and evaluates whether agents can both generate proof-of-concept exploits and patch the underlying bugs. The research team from UIUC and Purdue University focused on real-world vulnerabilities from OSS-Fuzz projects, creating a benchmark grounded in actual security incidents rather than synthetic problems.

Technical Insight

SEC-bench’s architecture solves a deceptively hard problem: how do you automatically convert a CVE identifier into a reproducible test environment? The framework uses a three-stage preprocessor pipeline that transforms raw vulnerability metadata into containerized instances. The first stage parses OSV database files to extract seed data, filtering by language, vulnerability type, and project whitelist. The second stage follows reference URLs to extract actual bug reports from issue trackers. The third stage generates project configurations that specify how to build vulnerable and patched versions of the code.

The build process creates layered Docker images. Base images contain the common dependencies and build toolchains for each project. Instance-specific images layer on top, checking out the exact vulnerable commit and configuring sanitizers (like AddressSanitizer for memory bugs). This separation means you build the expensive base images once and can quickly generate hundreds of vulnerability instances. The framework currently focuses on C/C++ vulnerabilities from OSS-Fuzz, where sanitizer instrumentation provides ground truth about whether a vulnerability exists.

Here’s what the basic workflow looks like in practice:

# Download OSV database and extract seeds for C/C++ CVEs
./run_preprocessor.sh seed \
    --input-dir ./output/osv \
    --output-file ./output/seed.jsonl

# Extract bug reports from OSS-Fuzz projects
./run_preprocessor.sh report \
    --input-file ./output/seed.jsonl \
    --type CVE \
    --oss-fuzz \
    --lang C,C++

# Generate minimal project configurations with sanitizers
./run_preprocessor.sh project \
    --input-file ./output/report-cve-oss-c-cpp.jsonl \
    --sanitizer-only \
    --minimal

# Build base images for all projects
python -m secb.preprocessor.build_base_images

# Build specific vulnerability instance
python -m secb.preprocessor.build_instance_images \
    --input-file ./output/project-cve-oss-c-cpp-sanitizer-minimal.jsonl \
    --ids openjpeg.cve-2024-56827

The evaluation harness supports multiple agent frameworks—SWE-agent, OpenHands, Aider, and smolagents—through a standardized interface. Each agent receives the vulnerable codebase and either a request to generate a PoC that triggers the bug or to patch the vulnerability. The containerized environment provides isolation and reproducibility: every agent run starts from the same filesystem state, and sanitizer output provides objective verification of whether the vulnerability was triggered or fixed.

What makes this architecture powerful is the verification layer implemented through the separate SecVerifier repository. Before an instance enters the benchmark, it must pass automated checks confirming that the vulnerable version actually triggers the sanitizer and the patched version doesn’t. This quality gate prevents broken instances from contaminating evaluation results—a critical concern when dealing with complex build systems and compiler toolchains.

The multi-agentic system for automated instance creation deserves special attention. Rather than requiring security researchers to manually curate each vulnerability, SEC-bench uses LLMs themselves to parse bug reports, identify relevant commits, and generate build configurations. This meta-level application of LLMs to create LLM benchmarks significantly reduces the manual labor required to maintain a growing benchmark as new vulnerabilities are disclosed.

Gotcha

SEC-bench demands significant infrastructure commitment. The documentation recommends over 200GB of disk space, and building a comprehensive set of instances can take hours even on powerful hardware. Each Docker image contains a full build environment with source code, dependencies, and compiled artifacts. If you’re working on a laptop or in a resource-constrained environment, you’ll need to be selective about which instances you build.

The current focus on C/C++ vulnerabilities from OSS-Fuzz projects limits applicability. If you’re developing agents for web application security, memory-safe languages, or other domains, SEC-bench won’t directly measure what you care about. The framework is extensible in principle—the preprocessor pipeline could handle other languages and vulnerability sources—but the existing dataset and tooling is optimized for memory safety bugs caught by sanitizers. Additionally, while the README mentions support for multiple agent frameworks, getting each one configured and integrated requires navigating their respective documentation and APIs. The evaluation harness provides standardization, but you’ll still need to understand how each agent framework expects to receive tasks and return results. The setup complexity makes SEC-bench better suited for dedicated research projects than quick experiments.

Verdict

Use SEC-bench if you’re doing serious research on LLM agents for security tasks, need reproducible evaluation of vulnerability patching capabilities, or want to compare agent frameworks on real-world CVEs rather than synthetic problems. It’s the right choice when you have the computational resources, Docker expertise, and time to invest in proper setup. The automated benchmark generation and verification pipeline pays dividends if you plan to continuously evaluate agents as new vulnerabilities emerge. Skip it if you need rapid prototyping, work outside C/C++ ecosystems, lack Docker infrastructure, or want general coding benchmarks rather than security-specific evaluation. The 200GB+ disk requirement and multi-hour build times make it impractical for casual experimentation. Also skip if you’re focused on security domains beyond memory safety vulnerabilities—web security, cryptographic bugs, or logic flaws aren’t currently covered.

// QUOTABLE

Most LLM coding benchmarks test whether AI can solve GitHub issues. SEC-bench asks a harder question: can your AI agent actually patch real-world memory safety vulnerabilities like those found in O...

[ Tweet This ]
// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/developer-tools/sec-bench-sec-bench.svg)](https://starlog.is/api/badge-click/developer-tools/sec-bench-sec-bench)