> your AI agent picks dependencies from memory; give it dated facts — try starlog.dev ↗ vet your agent's deps ↗ vibe-coding is fine. vibe-importing isn’t. — try starlog.dev ↗ vibe-importing isn’t fine ↗ your agent has never seen your private packages — try starlog.dev ↗ facts for private packages ↗ a linter for the dependencies your AI agent picks — try starlog.dev ↗ a linter for agent deps ↗

Back to Articles

BountyBench: A Framework for Benchmarking AI Agents on Security Vulnerability Research

[ View on GitHub ]

BountyBench: A Framework for Benchmarking AI Agents on Security Vulnerability Research

Hook

What if you could pit GPT-4 against Claude in a head-to-head competition to find security vulnerabilities in real codebases? BountyBench makes this possible, but it's not the bug bounty automation tool you might expect.

Context

The security research community has a problem: as large language models claim increasingly impressive coding capabilities, we lack standardized ways to evaluate their performance on actual security tasks. SWE-bench revolutionized how we measure AI performance on general software engineering, but finding vulnerabilities, exploiting them, and writing patches requires fundamentally different skills than fixing GitHub issues.

BountyBench emerged to fill this gap. Unlike static analysis tools that scan for known vulnerability patterns, or general-purpose coding benchmarks that test algorithm implementation, this framework evaluates whether AI agents can perform the complete workflow of a security researcher: analyzing unfamiliar code to discover novel vulnerabilities, crafting working exploits to prove impact, and developing patches that actually fix the issues. The name suggests bug bounty hunting, but the real purpose is research—creating reproducible experiments to understand what modern LLMs can and cannot do in adversarial security contexts.

Technical Insight

BountyBench's architecture centers on three distinct workflow phases: detect, exploit, and patch. Each phase operates as an isolated pipeline that can run multiple iterations, with the LLM receiving feedback from previous attempts. This iterative approach mirrors how human security researchers actually work—you don't find vulnerabilities in one shot; you probe, fail, adjust your hypothesis, and try again.

The Docker-in-Docker implementation is particularly clever. Each workflow runs inside a container, but that container itself can spawn additional containers for executing untrusted code. Here's how you'd configure a detection workflow:

# Example workflow configuration for vulnerability detection
workflow_config = {
    "task_id": "auth-bypass-001",
    "workflow_type": "detect",
    "model_provider": "openai",
    "model_name": "gpt-4",
    "max_iterations": 5,
    "docker_config": {
        "image": "bountybench/runner:latest",
        "shared_paths": ["/var/run/docker.sock"],
        "memory_limit": "4g"
    }
}

When the workflow executes, the framework clones the target repository into the container, then prompts the LLM with the codebase context and asks it to identify potential vulnerabilities. The LLM returns structured output—file paths, line numbers, vulnerability descriptions, and hypothesized exploit vectors. BountyBench then attempts to validate these hypotheses automatically, feeding the results back for the next iteration.

The multi-provider LLM integration deserves attention. Rather than hardcoding API calls to specific services, BountyBench abstracts provider logic behind a unified interface. You can switch from OpenAI to Anthropic to Google's models by changing configuration values, not code. The HELM (Holistic Evaluation of Language Models) integration takes this further, allowing researchers to run standardized evaluation protocols across different models and track metrics like successful vulnerability detection rate, false positive percentage, and iteration efficiency.

What makes this particularly useful for research is the git submodules structure. The bountytasks submodule contains versioned vulnerability test cases—real or synthetic codebases with known security issues. Researchers can contribute new tasks, version them independently, and ensure that different papers or experiments use identical test sets. This reproducibility is critical when comparing model performance across studies.

The web interface adds an exploratory dimension beyond batch CLI execution. You can watch in real-time as the LLM analyzes code, see its reasoning in each iteration, and manually intervene to test hypotheses. For example, if GPT-4 identifies a potential SQL injection but fails to craft a working exploit after three iterations, you can adjust the prompt or provide hints, then observe how this changes its approach. This interactive debugging reveals model capabilities that purely automated runs might miss.

Gotcha

The setup process is a minefield of version-specific dependencies and configuration gotchas. BountyBench requires exactly Python 3.11—not 3.10, not 3.12. The dependency installation can take 20-30 minutes on a fresh system, and you must manually initialize git submodules or the task repository will be empty, leading to cryptic errors when you try to run workflows. This isn't documented prominently in the README.

Docker Desktop configuration is worse. You need to explicitly add shared paths in Docker Desktop settings before the Docker-in-Docker mounting works, and the error messages when this fails are unhelpful. On macOS, the default file sharing settings won't include the necessary paths, so your containers will spawn but fail silently when trying to execute vulnerability tests. Debugging this requires understanding both Docker's mounting semantics and BountyBench's specific expectations.

The elephant in the room is documentation. With 84 stars and no repository description, this is clearly early-stage research code. There's no explanation of what vulnerability types the existing bountytasks cover, no published success rates for different models, and no guidance on creating new tasks. You're expected to read the source code and reverse-engineer the task format. For academic researchers replicating experiments, this is manageable. For practitioners hoping to use this in actual security work, it's a dealbreaker.

Verdict

Use if: You're conducting academic research on LLM capabilities in security domains and need a reproducible framework for running controlled experiments across multiple models. The Docker isolation and structured workflows make it excellent for generating quantitative data about model performance on vulnerability-related tasks, and the HELM integration provides standardized evaluation metrics. Also use if you're a security team wanting to experiment with AI-assisted vulnerability research in a safe sandbox environment.

Skip if: You need production-ready tooling, comprehensive documentation, or don't have 2-3 hours to invest in environment setup and troubleshooting. This is research infrastructure, not a product. Also skip if you want actual bug bounty automation—despite the name, this evaluates AI capabilities rather than autonomously hunting bugs at scale. If you're looking for immediate value in your security workflow, stick with established tools like Semgrep or CodeQL while monitoring BountyBench's evolution as a research project.