SEC-bench: Automated Benchmarking for LLM Security Agents in Real-World Vulnerability Scenarios
Hook
While researchers rush to claim their LLM agents can write code like senior developers, there’s a glaring blind spot: can these agents actually find and fix the security vulnerabilities that cost companies millions? SEC-bench is the first automated framework that answers this question with real CVEs instead of toy examples.
Context
The landscape of LLM-powered coding agents has exploded in the past two years, with tools like GitHub Copilot, Cursor, and various autonomous agents claiming to accelerate software development. Yet the security implications remain largely unexplored territory. Existing benchmarks like SWE-bench focus on general software engineering tasks—implementing features, fixing bugs, passing unit tests—but completely sidestep the specialized domain of security vulnerabilities. This gap is dangerous: a coding agent that can implement a REST API but misses SQL injection vulnerabilities isn’t just unhelpful, it’s actively harmful.
SEC-bench emerged from NeurIPS 2025 research addressing this critical evaluation gap. The framework tackles two distinct security challenges: offensive capabilities (generating proof-of-concept exploits for known vulnerabilities) and defensive capabilities (patching those vulnerabilities). By automating the entire pipeline from vulnerability database ingestion through containerized environment creation to agent evaluation, SEC-bench transforms what was previously a manual, inconsistent process into a reproducible benchmark. It pulls real vulnerabilities from the OSV database and OSS-Fuzz projects, creating Docker-based instances where agents can be tested against actual security flaws that affected production codebases. This isn’t academic theater—these are the same vulnerabilities that security researchers and attackers discover in the wild.
Technical Insight
SEC-bench’s architecture revolves around three interconnected stages that transform raw vulnerability data into executable benchmark environments. The preprocessor stage is where the magic begins: it queries vulnerability databases (OSV, CVE), extracts structured information about each vulnerability, and—critically—attempts to harvest bug reports from reference URLs. This step is more sophisticated than simple web scraping; the framework uses a multi-agent system to parse various bug report formats, identify relevant technical details, and generate project-specific configurations. The output is a standardized benchmark definition that can be fed into subsequent stages.
The instance builder stage is where SEC-bench demonstrates its engineering sophistication. Each vulnerability gets its own Docker container with the vulnerable codebase at the exact commit where the vulnerability existed. But here’s the clever part: the framework doesn’t just copy code into containers. It integrates sanitizers (AddressSanitizer, UndefinedBehaviorSanitizer, MemorySanitizer) to provide runtime feedback about memory safety violations and undefined behavior. This means when an agent generates a PoC exploit, the framework can automatically verify whether it actually triggers the vulnerability through sanitizer output, rather than relying on brittle pattern matching or manual inspection. The containerization ensures reproducibility—the same vulnerability instance will behave identically whether you’re running it on a laptop or a compute cluster.
The evaluator stage is where SEC-bench interfaces with various agent frameworks through a plugin architecture. Here’s a simplified example of how an agent might be invoked for a patching task:
# Simplified SEC-bench agent evaluation interface
from secbench.evaluator import AgentRunner
from secbench.tasks import PatchingTask
# Load a vulnerability instance
task = PatchingTask.from_config(
vulnerability_id="CVE-2023-12345",
container_image="secbench/instance-12345",
sanitizers=["asan", "ubsan"]
)
# Initialize agent (supports multiple frameworks)
runner = AgentRunner(
agent_type="sweagent", # or "openhands", "aider", etc.
model="gpt-4",
max_iterations=20
)
# Run the agent on the patching task
result = runner.evaluate(
task=task,
mode="patch",
timeout_minutes=30
)
# Check if patch successfully fixes vulnerability
if result.patch_applied:
# Run sanitizer-instrumented test suite
verification = task.verify_patch(
patch=result.patch_content,
run_poc=True # Ensure PoC no longer triggers vulnerability
)
print(f"Patch valid: {verification.is_valid}")
print(f"Sanitizer errors: {verification.sanitizer_output}")
What makes this architecture particularly powerful is the dual-task evaluation model. In PoC generation mode, the agent receives the vulnerability description and codebase, then attempts to write an exploit that triggers the bug. The sanitizers provide ground truth: if AddressSanitizer detects a heap buffer overflow when running the generated PoC, the agent successfully understood and exploited the vulnerability. In patching mode, the agent must both apply a fix and ensure that fix doesn’t break existing functionality while eliminating the vulnerability. This is verified by running both the project’s test suite and the PoC—the tests must pass, and the PoC must no longer trigger sanitizer errors.
The plugin architecture deserves special attention because it reveals SEC-bench’s philosophy about agent evaluation. Rather than coupling tightly to one agent framework, the system defines a minimal interface that any agent must satisfy: accept a task description, interact with a containerized environment, and produce either a PoC or a patch. This allows researchers to compare fundamentally different agent architectures (ReAct-style agents, chain-of-thought agents, multi-agent systems) on identical vulnerability instances. The framework handles all the complexity of container lifecycle management, sanitizer configuration, and result verification, letting researchers focus on agent design rather than benchmark infrastructure.
One subtle but important design choice is how SEC-bench handles temporal consistency. Vulnerabilities are pinned to specific commits—the vulnerable commit and the patching commit. This creates a clean before/after comparison and prevents agents from accidentally using post-fix code. The framework maintains a knowledge cutoff boundary: agents should only access information that would have been available to a developer at the time of the vulnerable commit. This prevents trivial solutions where agents simply retrieve known patches from their training data.
Gotcha
The infrastructure requirements for SEC-bench are substantial and non-negotiable. You’ll need over 200GB of disk space for vulnerability instances, a properly configured Docker environment with sufficient memory allocation (8GB+ recommended per container), and API tokens for both LLM providers and potentially GitHub for rate limiting purposes. This isn’t something you’ll casually spin up on a laptop during a weekend experiment. The documentation suggests a dedicated Linux machine or cloud instance, and they’re not exaggerating. Budget at least a full day for initial setup, longer if you hit Docker networking issues or API rate limits during benchmark generation.
The language coverage limitation is more subtle but equally important. While the framework theoretically supports multiple languages, the examples and documented instances heavily favor C/C++ projects, likely because sanitizers are most mature in that ecosystem. If your research focuses on Python, JavaScript, or Rust vulnerabilities, you’ll be pioneering new territory within SEC-bench rather than leveraging existing instances. The sanitizer integration that makes automatic verification so powerful in C/C++ doesn’t translate cleanly to memory-safe languages where vulnerability classes differ dramatically. You might find yourself writing custom verification logic, which defeats much of the automation advantage. Additionally, the dependency on the separate SecVerifier repository for validation adds another moving part—version mismatches or breaking changes between SEC-bench and SecVerifier can cause frustrating debugging sessions where benchmark results become unreliable.
Verdict
Use SEC-bench if you’re conducting academic research on LLM agent capabilities in security domains, need reproducible benchmarks for comparing agent architectures on real-world vulnerabilities, or are building security-focused coding assistants that need rigorous evaluation beyond ‘does it compile.’ The automated pipeline and sanitizer integration provide evaluation rigor that’s nearly impossible to achieve manually. Skip it if you’re working on lightweight security tooling without heavy infrastructure, focusing on languages outside the C/C++ ecosystem, need immediate results without extensive setup time, or prefer curated vulnerability datasets over automated generation. Individual developers exploring security agent concepts will find the infrastructure overhead overwhelming—this is a framework built for research labs and teams with dedicated compute resources, not hobbyist experiments.