RAPTOR: Building an Autonomous Security Agent from Claude Code and Adversarial Thinking
Hook
What if you could reduce 10,000 static analysis findings to 50 exploitable vulnerabilities without manually reviewing 9,950 false positives? RAPTOR chains LLMs with formal methods to make autonomous security validation actually work.
Context
Static analysis tools like Semgrep and CodeQL are phenomenally good at finding potential vulnerabilities—and therein lies the problem. A medium-sized codebase can generate thousands of SAST findings, with false positive rates approaching 95% depending on rule strictness. Security teams face an impossible choice: tune rules conservatively and miss real bugs, or drown analysts in manual triage.
Meanwhile, the AI agent explosion has produced dozens of frameworks that can write code or answer questions, but few that encode genuine offensive security expertise. Bug bounty hunters and red teams still manually chain tools, correlate findings, and apply adversarial thinking to separate interesting attack surfaces from noise. RAPTOR, built by security veterans Gadi Evron, Halvar Flake, and Daniel Cuthbert, attacks both problems simultaneously: it wraps Claude Code with security-specific orchestration while implementing a multi-stage LLM validation pipeline that dramatically compresses the false positive problem before human eyes see results.
Technical Insight
RAPTOR's architecture is fundamentally about staged validation with progressively expensive operations. The framework doesn't try to replace static analysis—it multiplies its effectiveness through intelligent filtering. The pipeline starts with Semgrep and CodeQL performing traditional pattern matching and dataflow analysis. Then Stages A through D apply increasingly sophisticated validation.
Stage A performs basic LLM-powered filtering: given a finding, can the model explain the dataflow and potential exploit? Stage B attempts exploitability assessment: is this theoretically exploitable given real-world constraints? Stage C generates proof-of-concept exploits, and Stage D proposes patches. Each stage acts as a filter, with only the most promising findings advancing. The insight is that spending 30 seconds of LLM reasoning per finding is vastly cheaper than 10 minutes of human analyst time, and the cumulative false positive reduction compounds across stages.
The Z3 integration showcases how RAPTOR blends formal methods with LLM reasoning. CodeQL can identify thousands of potential paths from user input to dangerous sinks, but many are infeasible due to conflicting constraints. Consider this example from the codebase:
# Simplified Z3 path validation logic
from z3 import Solver, Int, sat
def validate_path_feasibility(path_constraints):
solver = Solver()
# Convert CodeQL path constraints to Z3 expressions
for constraint in path_constraints:
solver.add(constraint.to_z3())
# Check satisfiability
if solver.check() == sat:
return True, solver.model() # Path feasible, return concrete values
return False, None # Path impossible, prune from analysis
# Used to pre-filter CodeQL results before LLM validation
feasible_paths = [p for p in codeql_results
if validate_path_feasibility(p.constraints)[0]]
This pre-screening eliminates impossible paths before burning LLM tokens, but the same Z3 engine does something more interesting for binary exploitation: it ranks ROP gadgets by constraint satisfaction rather than naive heuristics. A gadget that looks promising syntactically might require impossible register states to reach; Z3 can prove this algebraically.
The Claude.md configuration system is how RAPTOR transforms general-purpose Claude Code into a security agent. These markdown files define sub-agents, rules, and skills using structured prompts. A simplified security agent definition might look like:
# Security Analysis Agent
## Identity
You are an offensive security researcher specializing in vulnerability discovery.
Think like an attacker. Assume code is hostile until proven otherwise.
## Skills
- semgrep: Run Semgrep with specified rulesets
- codeql: Execute CodeQL queries on codebase
- z3_validate: Check path feasibility with constraint solving
- generate_poc: Create proof-of-concept exploit code
## Rules
1. Always validate findings before reporting
2. Correlate findings to identify attack chains
3. Prioritize by exploitability, not just presence
4. Generate concrete PoCs, not theoretical descriptions
## Workflow
1. Run static analysis tools on target codebase
2. Apply Stage A-D validation pipeline
3. Cross-correlate findings for attack chain discovery
4. Generate ranked report with PoCs
The framework orchestrates tool execution through this declarative configuration. When you invoke Raptor on a project, it parses the Claude.md files, initializes sub-agents with their specialized prompts, and coordinates tool invocation based on the defined workflow. The project-based system maintains state across runs, so findings from previous analyses inform current correlation.
The offline capability deserves emphasis: Raptor bundles the entire Semgrep registry as a cached ruleset. This isn't just convenient—it's architecturally significant. Most AI agent frameworks assume constant internet connectivity for API calls, model updates, and external tool registries. RAPTOR runs in air-gapped environments, making it viable for CI/CD pipelines handling sensitive code that cannot touch external networks. The devcontainer packages everything except the Claude API calls, which can be proxied or replaced with self-hosted models given sufficient expertise.
The fuzzing integration shows how traditional security tools compose with LLM reasoning. Raptor can invoke AFL++ for coverage-guided fuzzing, analyze crashes with the LLM agent to determine exploitability, and correlate crash signatures with static analysis findings. A crash that AFL discovers in isolation might seem interesting; when the LLM agent notices it corresponds to a dataflow path that Semgrep flagged and Z3 proved feasible, it jumps to high priority. This cross-tool correlation is where autonomous agents add genuine value beyond simple automation.
Gotcha
The authors are admirably honest: this is experimental software held together with 'enthusiasm and duct tape.' Expect incomplete features, minimal documentation beyond command-line help, and the need to read source code to understand behavior. The web scanning functionality is explicitly marked as alpha/stub status. You're not getting a polished product with GitHub issues triaged and a support team standing by—you're getting a research artifact from industry veterans that works for their workflows.
The Claude Code dependency is both a strength and constraint. The tight coupling to Anthropic's ecosystem means you're betting on their API availability, pricing, and capabilities. While the architecture claims to be pluggable—and technically you could swap the LLM backend—the prompt engineering is optimized for Claude's specific reasoning patterns. Porting to GPT-4 or open models would require significant prompt rewriting and validation that the staged pipeline still achieves comparable false positive reduction. The CodeQL licensing issue is more problematic: commercial use is restricted, which creates legal uncertainty for enterprise security teams that might want to deploy this on proprietary codebases. The 6GB devcontainer and privileged Docker requirement also limit deployment scenarios—many locked-down corporate environments won't permit privileged containers or that much local storage footprint.
Verdict
Use if: You're on a red team, doing security research, or hunting bugs at scale and spend more time triaging false positives than finding real vulnerabilities. The multi-stage LLM validation alone can save dozens of hours on medium-sized audits, and the Z3 path pruning prevents wasted effort on impossible exploitation scenarios. Use if you're comfortable reading Python source to understand behavior, working with experimental tooling, and have the offensive security background to evaluate whether the agent's reasoning makes sense. The cross-finding correlation for attack chain discovery is genuinely novel and valuable for complex codebases. Skip if: You need production-ready software with documentation, support, and stable APIs. Skip if commercial licensing uncertainty around CodeQL is a blocker for your use case, or if your environment cannot accommodate privileged Docker containers and large images. Also skip if you're uncomfortable with autonomous agents making security decisions—the staged pipeline reduces but doesn't eliminate the need for expert human validation. For organizations needing compliance-ready SAST with vendor support, stick with GitHub Advanced Security or commercial Semgrep offerings rather than experimental agent frameworks.