Back to Articles

BountyBench: Testing Whether AI Can Actually Hunt Vulnerabilities

[ View on GitHub ]

BountyBench: Testing Whether AI Can Actually Hunt Vulnerabilities

Hook

What if we could measure whether GPT-4 or Claude can find security vulnerabilities as well as a bug bounty hunter? BountyBench turns this question into reproducible experiments, using real-world CVEs as the proving ground.

Context

The security industry has a fundamental scaling problem: there are more codebases than security researchers, and vulnerabilities are discovered faster than they can be patched. Static analysis tools help, but they produce false positives and miss logic flaws. Dynamic fuzzing finds crashes but struggles with business logic vulnerabilities. Human security researchers remain the gold standard, but they don’t scale.

Enter the latest wave of large language models with impressive code reasoning capabilities. Can Claude or GPT-4 actually find security vulnerabilities in unfamiliar codebases? Can they write working exploits? Can they generate patches that don’t break functionality? BountyBench provides an empirical framework to answer these questions. Unlike generic code generation benchmarks like HumanEval, BountyBench focuses specifically on security tasks using real vulnerability reports from bug bounty programs, creating a testbed that measures whether LLMs can perform the complete vulnerability lifecycle: detection, exploitation, and remediation.

Technical Insight

Multi-Phase Iteration

workflow type + task

vulnerable codebase

+ validation criteria

spawn

code context

+ prompts

exploration commands

hypothesis

exploit/patch code

spawn with limits

execution results

validation output

feedback loop

structured report

CLI Entry Point

Workflow Orchestrator

LLM Agent

BountyTasks Git Submodules

Outer Docker Container

Inner Docker Container

Isolated Execution

Results Output

System architecture — auto-generated

BountyBench’s architecture revolves around three distinct workflow types that mirror the actual work of security researchers. Each workflow operates as a multi-phase iterative process where an LLM agent analyzes code, generates hypotheses, writes code (exploits or patches), and validates results in isolated Docker containers. The system stores test cases as git submodules in the bountytasks directory, each containing the vulnerable codebase, expected exploit artifacts, and validation criteria.

The workflow orchestration is surprisingly straightforward. When you invoke the detect workflow, BountyBench spins up a Docker container with the target codebase, provides the LLM with context about the project structure, and iteratively prompts it to identify potential vulnerabilities. Here’s what a typical workflow invocation looks like:

# From the BountyBench CLI
python -m bountybench.cli detect \
  --task bountytasks/example-xss \
  --model anthropic/claude-3-5-sonnet-20241022 \
  --max-iterations 5 \
  --output results/

# The framework internally manages the LLM conversation:
# Phase 1: Code exploration - LLM lists files, reads code
# Phase 2: Vulnerability hypothesis - LLM identifies suspicious patterns
# Phase 3: Validation - LLM writes test cases to confirm the vuln
# Phase 4: Reporting - LLM generates structured vulnerability report

The Docker-in-Docker architecture deserves special attention. When testing the exploit workflow, BountyBench needs to execute potentially malicious code that the LLM generates. Rather than trusting the LLM’s output, each exploit attempt runs in a nested Docker container with strict resource limits and network isolation. The outer container orchestrates the workflow while inner containers provide blast radius containment. This design choice reflects hard-won lessons from the security research community: never trust generated code, especially when that code is explicitly designed to break systems.

What makes BountyBench particularly interesting is its provider-agnostic LLM integration layer. The framework abstracts model providers behind a common interface, allowing researchers to benchmark Claude against GPT-4 against open-source models using identical test cases. The model configuration is straightforward:

# config/models.yaml
models:
  claude-sonnet:
    provider: anthropic
    model: claude-3-5-sonnet-20241022
    temperature: 0.7
    max_tokens: 4096
  
  gpt4-turbo:
    provider: openai
    model: gpt-4-turbo-preview
    temperature: 0.7
    max_tokens: 4096

  # Mock model for testing the framework itself
  mock:
    provider: mock
    responses: fixtures/mock-responses.json

The mock provider is particularly clever for framework development. You can record actual LLM responses, then replay them deterministically during testing, avoiding expensive API calls while debugging workflow logic.

The patch workflow introduces additional complexity because it must validate both security (does the patch fix the vulnerability?) and correctness (does the patch break existing functionality?). BountyBench handles this through a two-stage validation process. First, it verifies that the exploit no longer works against the patched code. Second, it runs the project’s existing test suite to catch regressions. This mirrors real-world security patching where you can’t just delete the vulnerable code—you need to maintain functionality while eliminating the flaw.

Each bountytask submodule follows a standardized structure: a task.yaml file describing the vulnerability, a repo/ directory containing the vulnerable code at the specific commit, an exploit/ directory with reference exploits, and a patch/ directory with reference fixes. This structure enables researchers to contribute new test cases by simply adding submodules, growing the benchmark suite organically as new vulnerabilities are disclosed and patched.

Gotcha

BountyBench’s setup complexity will test your patience. You need Docker Desktop (not Docker Engine alone, due to the Docker-in-Docker requirements), Python 3.11 specifically, API keys for whichever LLM providers you want to test, and enough disk space for multiple isolated container environments. The git submodules approach means your initial clone is tiny, but you’ll need to manually initialize each bountytask you want to run. On macOS, Docker’s file sharing permissions can cause cryptic failures where the inner containers can’t access mounted volumes—you’ll need to explicitly add paths to Docker Desktop’s shared directories.

The documentation assumes significant prior knowledge. There’s no explanation of what constitutes a successful detection or exploitation, how the scoring system works, or what the output artifacts mean. You’re expected to read the source code to understand result interpretation. The project also lacks guidance on creating new bountytasks, so contributing additional test cases requires reverse-engineering the expected structure from existing examples. For a research framework, this might be acceptable, but it raises the barrier to entry unnecessarily.

Performance and cost are non-trivial concerns. Running a single workflow through five iterations with GPT-4 can consume thousands of tokens per phase, translating to several dollars per test case. The Docker containers can be resource-intensive, especially when running exploits that intentionally stress system resources. Budget for both API costs and compute time when planning benchmark runs. There’s no built-in rate limiting or cost tracking, so a misconfigured batch run could rack up unexpected charges.

Verdict

Use BountyBench if you’re conducting research on LLM capabilities in cybersecurity domains—specifically, if you need empirical data about how well different models perform security-critical reasoning tasks. It’s valuable for academic papers comparing model architectures, for AI safety research exploring whether models can discover novel vulnerabilities, or for security teams evaluating whether LLM-assisted tooling could augment their workflows. The framework shines when you need reproducible, controlled experiments with real-world vulnerability test cases rather than synthetic benchmarks.

Skip BountyBench if you’re looking for production security tooling. This is a research framework, not a product. If you need actual vulnerability scanning for your codebase, use mature static analysis tools like Semgrep or CodeQL that provide consistent results without the unpredictability and cost of LLM inference. Skip it if you lack the infrastructure for Docker-heavy workflows or don’t want to invest time understanding the codebase to interpret results. Also skip it if you’re working in an environment with strict data privacy requirements—the workflows send your code to third-party LLM providers, which may not be acceptable for proprietary or sensitive codebases.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/cybersecurity/bountybench-bountybench.svg)](https://starlog.is/api/badge-click/cybersecurity/bountybench-bountybench)