SEC-bench: A NeurIPS Framework for Benchmarking LLM Agents Against Real Security Vulnerabilities

Hook

When researchers tested leading LLM agents on real-world security vulnerabilities, they discovered these AI systems couldn't even reproduce most CVE exploits—let alone patch them. SEC-bench exposes exactly how wide this gap is.

Context

The explosion of LLM-powered coding agents like GitHub Copilot, Cursor, and Aider has created a fundamental question: can these systems handle security-critical code? While benchmarks like SWE-bench evaluate general software engineering tasks, they don't specifically test whether LLMs can reason about memory corruption, race conditions, or integer overflows—the bread and butter of real-world vulnerabilities.

This gap matters because security work differs fundamentally from feature development. A successful vulnerability patch requires understanding exploit mechanics, identifying root causes across complex codebases, and verifying that fixes don't introduce new attack surfaces. Meanwhile, Proof-of-Concept (PoC) generation demands creative adversarial thinking to trigger edge cases. SEC-bench, accepted at NeurIPS 2025, addresses this void by creating the first comprehensive benchmark that evaluates LLM agents specifically on real CVE instances, complete with reproducible Docker environments and automated verification.

Technical Insight

SEC-bench's architecture operates as a four-stage pipeline that transforms raw vulnerability data into executable benchmarks. The journey begins with data collection, where a multi-agentic system scrapes OSV and CVE databases to extract vulnerability metadata. This isn't simple web scraping—the system must navigate GitHub issues, GitLab merge requests, and project-specific bug trackers, each with different formats and information density. The collectors extract commit hashes, affected files, build configurations, and human-written descriptions that provide context about the vulnerability's nature.

The instance building phase is where SEC-bench shows its technical sophistication. For each vulnerability, the framework constructs a Docker container that recreates the exact environment where the bug can be reproduced. This involves checking out specific git commits, installing precise dependency versions, and configuring build systems (CMake, Autotools, custom Makefiles) to compile vulnerable code. The framework supports both sanitizer-enabled builds (AddressSanitizer, UndefinedBehaviorSanitizer) for catching memory errors and standard builds for functional testing. Here's what a typical instance configuration looks like:

{
  "vulnerability_id": "CVE-2023-XXXXX",
  "project": "libxml2",
  "commit_hash": "a1b2c3d4e5f6",
  "build_type": "autotools",
  "sanitizers": ["address", "undefined"],
  "test_command": "./xmllint --valid test_input.xml",
  "expected_crash": true,
  "crash_signature": "heap-buffer-overflow"
}

The evaluation phase integrates multiple agent frameworks—SWE-agent, OpenHands, Aider, and smolagents—through a unified interface. Each framework receives the same task in two flavors: PoC generation (given varying levels of context about the vulnerability) or vulnerability patching (given the buggy code and asked to fix it). The agents operate entirely within Docker containers, preventing them from accessing external information or making uncontrolled system modifications. This isolation is critical for reproducibility.

What makes SEC-bench particularly valuable is its multi-modal evaluation approach. For PoC generation tasks, the system provides three context levels: minimal (just the CVE description), moderate (including affected files), and maximal (with the actual patch). This graduated disclosure tests whether agents can perform security research with limited information—mimicking real-world scenarios where attackers work from sparse vulnerability disclosures. The verification step uses SecVerifier to confirm that generated PoCs actually trigger the vulnerability and that patches genuinely prevent exploitation without breaking functionality.

The scoring mechanism combines multiple signals: whether the agent produced a working PoC, whether patches pass existing tests, whether sanitizers still detect issues, and whether the solution introduces regressions. This nuanced scoring reveals that current LLM agents struggle particularly with memory-safety vulnerabilities requiring deep understanding of pointer semantics and memory layouts—areas where symbolic reasoning still outperforms statistical learning.

The framework's Docker-centric design solves a persistent problem in security research: reproducibility. By containerizing everything from compiler versions to library dependencies, SEC-bench ensures that a vulnerability instance behaves identically whether tested today or three years from now. Each container includes a complete toolchain, source code at the vulnerable commit, test harnesses, and verification scripts. Researchers can pull pre-built images from the registry or rebuild from scratch using provided Dockerfiles, making the benchmark accessible despite its complexity.

Gotcha

SEC-bench's power comes with substantial operational overhead. The framework demands over 200GB of disk space for Docker images, multiple API tokens (GitHub, GitLab, OpenAI, Anthropic), and significant computational resources—building vulnerable instances from source can take hours for large projects. Setup complexity is non-trivial; you'll need comfort with Docker networking, understanding of how different build systems work, and patience for debugging containerization issues when projects have unusual dependencies. The documentation assumes familiarity with CVE databases and vulnerability research, making the learning curve steep for developers without security backgrounds.

The current dataset skews heavily toward C/C++ vulnerabilities from OSS-Fuzz, which makes sense given that memory-safety issues dominate security research, but limits applicability for teams working in JavaScript, Python, or Rust ecosystems. While the framework theoretically supports any language, the examples, pre-built instances, and verification tooling are optimized for compiled languages with sanitizer support. Additionally, the verification step depends on an external SecVerifier repository, meaning the pipeline isn't truly self-contained—you'll be managing dependencies across multiple repos and hoping their versions stay compatible.

Verdict

Use SEC-bench if you're researching LLM capabilities on security-critical tasks, benchmarking agent frameworks for vulnerability management workflows, or need rigorous evaluation of AI systems before deploying them in security contexts. The automated CVE-to-Docker pipeline alone justifies adoption for security teams wanting reproducible vulnerability testing environments. It's particularly valuable for academic research comparing different agent architectures on adversarial reasoning tasks. Skip it if you need quick prototyping (the setup investment is measured in days, not hours), work primarily outside C/C++ ecosystems, lack dedicated infrastructure for running Docker workloads at scale, or just want general coding assistance benchmarks—SWE-bench offers broader language coverage with simpler setup. This is a research-grade tool for serious security evaluation, not a lightweight testing framework.

SEC-bench: A NeurIPS Framework for Benchmarking LLM Agents Against Real Security Vulnerabilities

SEC-bench: A NeurIPS Framework for Benchmarking LLM Agents Against Real Security Vulnerabilities

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

SEC-bench: A NeurIPS Framework for Benchmarking LLM Agents Against Real Security Vulnerabilities

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// RELATED

Harness-1: Training Search Agents with State Externalization

makemore: Understanding Language Models by Implementing Them Seven Different Ways

JARVIS: The LLM-Orchestrated AI System That Pioneered Multi-Model Task Automation

How Open-Assistant Built a ChatGPT Alternative with 160,000 Crowdsourced Conversations

Harness-1: Training Search Agents with State Externalization

makemore: Understanding Language Models by Implementing Them Seven Different Ways

JARVIS: The LLM-Orchestrated AI System That Pioneered Multi-Model Task Automation

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]