Audit: An 8-Stage LLM Pipeline That Hunts Vulnerabilities Through Adversarial Validation
Hook
A single-pass LLM security scan produces 200 findings. An adversarial validation pipeline with the same models produces 3—all exploitable. The difference is disagreement by design.
Context
Static analysis tools like Semgrep and Bandit excel at catching OWASP Top 10 patterns but miss logic bugs and novel vulnerability chains. Manual code review finds these deeper issues but doesn't scale beyond small codebases. When security teams started experimenting with LLMs for vulnerability discovery in 2023-2024, they hit the exhaustive-agent problem: ask Claude or GPT-4 to "find bugs in this repository" and you get hundreds of theoretical findings with no reachability proof—SQL injection in a function that's never called with user input, XSS in admin-only templates, race conditions in single-threaded code. The signal-to-noise ratio makes LLM security scanning unusable for anything beyond educational demos.
Cloudflare's Glasswing research identified the core issue: a single model generates findings and validates them with the same biases, creating confirmation loops. Their solution was adversarial validation—use competing models where one tries to disprove the other's work. Evilsocket's audit tool implements this pattern as an 8-stage state machine that turns vulnerability discovery into a multi-agent debate with reachability gates, cost tracking, and automatic PoC compilation. It's designed for offensive security teams running deep audits on 10KLOC+ codebases where the goal isn't compliance checkboxes but finding actual 0-days that traditional SAST tools miss.
Technical Insight
The architecture is a state machine orchestrator spawning isolated Claude agents for each of eight stages: Setup, Recon, Hunt, Validate, Trace, Feedback, Reproduce, and Report. Each stage runs as a subprocess with a dedicated markdown prompt file, JSON Schema contract, and tool allowlist from the claude-agent-sdk (Read, Grep, Glob, Bash). The orchestrator manages SQLite-backed state transitions and enforces schema compliance through a validation-plus-repair loop—if an agent returns malformed JSON, a second LLM call fixes the structure before the next stage consumes it.
The killer feature is adversarial validation between Hunt and Validate stages. Hunt runs on Claude Sonnet, tasked with finding vulnerabilities and generating proof-of-concept exploits in isolated scratch directories where it has full Bash access for compilation and execution. Validate runs on Claude Opus with a single job: disprove Hunt's findings. Different models create genuine disagreement rather than confirmation bias. Here's how the orchestrator structures a Hunt task:
# From orchestrator.py - Hunt stage task generation
for vuln_class in recon_output['attack_surfaces']:
hunt_task = {
'id': f"hunt_{vuln_class['category']}",
'target_files': vuln_class['files'],
'attack_pattern': vuln_class['pattern'],
'context': vuln_class['git_history'],
'scratch_dir': f"work/hunt_{uuid4()}",
'tools': ['Read', 'Grep', 'Bash'],
'model': 'claude-sonnet-3-5',
'budget_usd': 2.00
}
Each Hunt task runs in a throwaway directory with Bash access, so the agent can clone dependencies, compile code, and run actual exploits—this is dynamic validation, not LLM hallucination about exploitability. If Hunt claims an XSS vulnerability, it must provide a working HTML payload that triggers when pasted into the scratch environment.
The Trace stage acts as the reachability gate that kills 90% of false positives. For every Hunt finding that survives Validate, Trace performs static dataflow analysis: can attacker-controlled input actually reach this sink? It searches for call chains from HTTP handlers, CLI arguments, environment variables, or file uploads to the vulnerable function. Most "potential bugs" die here when they can't prove a path from external input to the exploitable code. This is the crucial filter that makes the pipeline practical—without it, you'd drown in findings about vulnerabilities in unused code paths.
The Recon stage includes git-mining that's genuinely clever for lateral movement. It greps commit history for security patches using patterns like 'CVE-', 'XSS', 'sanitize', 'escape', 'validate', then generates Hunt tasks targeting sibling code with the same structure:
# Simplified from Recon stage logic
security_commits = grep_git_log(repo, patterns=['CVE', 'sanitize', 'fix.*injection'])
for commit in security_commits:
patched_files = parse_diff(commit)
pattern = extract_vuln_pattern(commit.message, patched_files)
similar_files = find_similar_code(repo, patched_files, min_similarity=0.7)
yield HuntTask(
category=pattern.vuln_class,
files=similar_files,
hypothesis=f"Similar to {commit.sha[:8]}, check for {pattern.description}"
)
If auth.py was patched for SQL injection in a WHERE clause six months ago, Recon automatically hunts auth_v2.py, oauth.py, and any other modules with similar database query patterns. This turns security history into attack surface mapping.
The Feedback stage (stage 7) closes the loop by converting validated traces back into new Hunt tasks. When Trace proves a vulnerability is reachable, Feedback extracts the root cause pattern and spawns searches for the same bug class in other subsystems. One real XSS finding in a template renderer triggers automatic hunts for XSS in all other template systems—creating a vulnerability-snowball effect where each confirmed bug multiplies the search space.
Cost control is baked into the state store with cooperative abort checks. Each agent polls for budget exhaustion between tool calls, and the orchestrator enforces per-stage limits. The auth module includes a clever revenue-protection mechanism: it scrubs ANTHROPIC_API_KEY from the environment while preserving ANTHROPIC_BASE_URL and ANTHROPIC_AUTH_TOKEN, forcing OAuth-based subscription billing (Claude Pro credits) instead of metered API routing. This prevents accidental cost overruns but also ensures users burn their $20/month subscription rather than pay-per-token.
Gotcha
The biggest landmine is the complete absence of sandboxing. Hunt agents execute arbitrary Bash commands in work/ subdirectories on your host machine. If you point this at a malicious repository with trojaned build scripts or Makefiles, those scripts run with your user privileges during PoC compilation. There's no containerization, no seccomp filters, no network restrictions during Hunt—just raw subprocess execution. You need disposable VMs or Docker containers with snapshot-revert for any audit of untrusted code.
Cost explosion is real despite the budget guards. A 50KLOC repository can generate 40+ Hunt tasks at $0.50-2.00 each, then 30+ Validate rounds (Opus is expensive), then Trace calls for every surviving finding. The $30 budget flag in the README is a floor, not a ceiling—most production audits hit $100-300 in Claude API credits before surfacing 2-3 high-severity findings. The concurrency limits just spread the cost over more time. The adversarial validation pattern only works if Hunt and Validate use genuinely different models; if you route everything through OpenRouter to a single backend or use the same model twice, you lose the disagreement mechanism but the code doesn't warn you. Trace stage reachability analysis is static dataflow done by an LLM, so it misses reflection-based sinks, dynamic property access (obj[user_input]()), and framework magic that a real taint tracker like CodeQL would catch.
Verdict
Use if: You're an offensive security team with Claude Pro subscriptions auditing 10KLOC+ mature codebases where finding 2-3 real 0-days justifies $100-300 in API costs, you have disposable VMs for the unsandboxed execution, and your target has rich git security history for the mining stage to exploit. The adversarial validation and reachability gating genuinely surface novel logic bugs that Semgrep misses. Skip if: You're doing compliance-checkbox AppSec (run Bandit instead for free), you're cost-sensitive or working with codebases under 5KLOC (ROI is negative), you don't have sandbox infrastructure for Bash execution, or you need results in minutes rather than hours—this is a deep-audit tool, not a CI/CD gate.