Back to Articles

RAPTOR: Turning Claude Code Into a Security Research Agent With .claude.md Files

[ View on GitHub ]

RAPTOR: Turning Claude Code Into a Security Research Agent With .claude.md Files

Hook

What if your IDE could autonomously discover vulnerabilities, generate exploits, and investigate deleted GitHub repositories—all by reading a configuration file? RAPTOR shows how Claude Code’s customization system transforms a general-purpose AI assistant into a domain-specific offensive security agent.

Context

Security researchers face a workflow problem: vulnerability discovery requires orchestrating dozens of specialized tools—Semgrep for static analysis, AFL for fuzzing, CodeQL for dataflow tracking—while simultaneously reasoning about exploitability, attack surfaces, and remediation strategies. Traditional approaches fall into two camps: fully manual analysis where researchers context-switch between tools and synthesize findings themselves, or rigid automation pipelines that generate high false-positive rates because they lack contextual understanding.

RAPTOR exploits a newer paradigm: configurable AI agents that layer domain expertise onto general-purpose LLMs. Built specifically for Anthropic’s Claude Code IDE, it uses the .claude/ directory structure—a customization system where markdown files define agent behaviors, skills, and tool integrations. By writing Claude.md configuration files that embed adversarial thinking patterns and orchestrate security tooling, RAPTOR creates what amounts to an autonomous penetration tester. It’s not just chatbot-assisted security research; it’s multi-agent orchestration where specialized sub-agents handle reconnaissance, exploitation, patching, and forensics workflows while coordinating traditional security tools through agentic reasoning.

Technical Insight

Loads custom prompts

Available actions

Automation modules

Adversarial patterns

Query with tools

Primary

Fallback 1

Fallback 2

Response

Response

Response

Cost tracking

Budget check

Usage metrics

Claude Code IDE

.claude/ Directory

commands/

skills/

rules/

Security Agent Core

LiteLLM Orchestrator

Claude 3.5 Sonnet

GPT-4 Turbo

Gemini Pro

Budget Manager

Cost Logger

System architecture — auto-generated

The core architectural insight of RAPTOR is that Claude Code’s customization system—originally designed for general coding assistance—can be hijacked for domain-specific agent behaviors through carefully crafted markdown prompts. The .claude/ directory becomes a prompt engineering framework: commands/ defines available actions, skills/ provides reusable automation modules, and rules/ embeds adversarial reasoning patterns. This transforms the IDE from a code completion tool into a security operations orchestrator.

The LiteLLM integration demonstrates production-grade LLM orchestration patterns. Rather than directly calling Claude’s API, RAPTOR uses LiteLLM as an abstraction layer with sophisticated fallback logic, cost tracking, and quota management:

from litellm import completion, BudgetManager
from litellm.integrations.custom_logger import CustomLogger
import json

class SecurityAgentLLM:
    def __init__(self, budget_usd=100.0):
        self.budget_manager = BudgetManager(max_budget=budget_usd)
        self.logger = CostTrackingLogger()
        
    def query_with_fallback(self, messages, tools=None):
        providers = [
            {"model": "claude-3-5-sonnet-20241022", "timeout": 30},
            {"model": "gpt-4-turbo", "timeout": 20},
            {"model": "gemini-pro", "timeout": 15}
        ]
        
        for config in providers:
            try:
                response = completion(
                    model=config["model"],
                    messages=messages,
                    tools=tools,
                    timeout=config["timeout"],
                    callbacks=[self.logger]
                )
                
                # Track costs against budget
                cost = self.logger.get_last_request_cost()
                if not self.budget_manager.check_budget(cost):
                    raise BudgetExceededError(
                        f"Request ${cost:.4f} exceeds remaining budget"
                    )
                
                return response
            except Exception as e:
                if "rate_limit" in str(e).lower():
                    print(f"Rate limited on {config['model']}, trying next provider")
                    continue
                raise

This pattern solves real operational problems: automatic provider fallback prevents single-point-of-failure on API quotas, cost tracking with budget enforcement prevents runaway expenses during autonomous operations, and unified interfaces mean security skills can be provider-agnostic.

The OSS Forensics architecture showcases multi-source evidence collection for investigating compromised or suspicious repositories. When analyzing a supply chain attack or investigating deleted malicious code, RAPTOR queries three complementary data sources:

class OSSForensicsAgent:
    def investigate_repository(self, repo_url, suspicious_timeframe):
        evidence = {}
        
        # 1. Immutable event history from GH Archive via BigQuery
        # Catches deleted commits, force-pushes, maintainer changes
        query = f"""
        SELECT type, created_at, payload
        FROM `githubarchive.day.*`
        WHERE repo.name = '{repo_url.split('github.com/')[1]}'
        AND created_at BETWEEN '{suspicious_timeframe['start']}' 
        AND '{suspicious_timeframe['end']}'
        ORDER BY created_at
        """
        evidence['immutable_events'] = self.bigquery_client.query(query).result()
        
        # 2. Current state via GitHub API
        # Compare against historical events to detect deletions
        evidence['current_state'] = self.github_client.get_repo_state(repo_url)
        
        # 3. Wayback Machine for deleted artifacts
        # Recover removed documentation, homepages, or code
        evidence['archived_snapshots'] = self.wayback_client.get_snapshots(
            repo_url,
            timeframe=suspicious_timeframe
        )
        
        # LLM analyzes discrepancies across sources
        analysis_prompt = self.build_forensics_prompt(evidence)
        return self.llm.query_with_fallback(messages=[analysis_prompt])

This multi-source approach addresses a critical forensics challenge: malicious actors delete evidence. GitHub’s API shows current state, but attackers can force-push to rewrite history or delete repositories entirely. GH Archive provides an immutable audit trail of all GitHub events (commits, PRs, issues), while Wayback Machine recovers deleted web artifacts. The LLM agent correlates discrepancies—if GH Archive shows commits that no longer exist in the API response, that’s evidence of history rewriting.

The SecOpsAgentKit demonstrates nested agent architecture for specialized penetration testing workflows. Rather than a monolithic security agent, RAPTOR decomposes operations into specialist sub-agents: a reconnaissance agent maps attack surfaces, an exploitation agent tests vulnerabilities, and a reporting agent synthesizes findings. Each sub-agent has its own Claude.md configuration defining role-specific reasoning patterns and tool access. This mirrors how human red teams organize—specialists collaborate rather than generalists doing everything.

The skills system provides the extensibility mechanism. Security researchers contribute Python modules that wrap tools like Semgrep or AFL, exposing them to the agent through function calling:

# skills/semgrep_analysis.py
def analyze_with_semgrep(target_path: str, rule_set: str = "auto") -> dict:
    """
    Run Semgrep static analysis on target code.
    
    Args:
        target_path: Path to code to analyze
        rule_set: Semgrep rule set (auto, security, owasp-top-10)
    
    Returns:
        dict with findings, severity, and confidence scores
    """
    result = subprocess.run(
        ["semgrep", "--config", rule_set, "--json", target_path],
        capture_output=True,
        text=True
    )
    
    findings = json.loads(result.stdout)
    
    # LLM validates findings by examining dataflow
    validated = []
    for finding in findings['results']:
        if validate_exploitability(finding):
            validated.append(finding)
    
    return {"findings": validated, "raw_output": findings}

This creates a feedback loop: traditional tools provide initial findings, the LLM validates exploitability by reasoning about dataflow and business logic, then generates exploit code or patches. It augments rather than replaces proven static analysis.

Gotcha

The authors describe RAPTOR as ‘quick hack held together by vibe coding and duct tape,’ and they’re not joking. This is a research prototype with sharp edges. Error handling is minimal—if a sub-agent fails mid-operation, you’ll likely get cryptic exceptions rather than graceful degradation. The autonomous tool installation behavior is particularly aggressive: unless you’re using the devcontainer, RAPTOR will directly install Semgrep, CodeQL, AFL, and their dependencies on your system without asking. For researchers on managed workstations or environments with security policies, this can cause conflicts with existing tooling or violate compliance requirements.

The Claude Code dependency is the fundamental limitation. RAPTOR isn’t a standalone CLI tool or library you can import into existing workflows—it’s a configuration layer for Anthropic’s desktop IDE. You cannot run it in CI/CD pipelines, integrate it into existing security automation, or use it programmatically without the IDE running. If you’re building production security infrastructure or need programmatic access, RAPTOR’s architecture is incompatible. It’s specifically designed for interactive security research sessions within the Claude Code environment, making it a power user’s workbench rather than an automation platform.

Verdict

Use if: You’re already using Claude Code for security work and want autonomous vulnerability analysis workflows that coordinate multiple tools while applying LLM reasoning for exploitability validation—especially valuable if you’re investigating supply chain compromises or need GitHub forensics capabilities that recover deleted evidence. RAPTOR excels at exploratory research, CTF automation, and rapid prototyping of security analysis approaches where speed of experimentation matters more than production robustness. Skip if: You need standalone tooling for CI/CD integration, production-grade reliability with comprehensive error handling, or fine-grained control over system modifications. The vendor lock-in to Claude Code makes it unsuitable for programmatic security automation or team environments where not everyone uses the same IDE. If you’re building enterprise security infrastructure rather than doing interactive research, use Semgrep/CodeQL directly in your pipelines or evaluate general-purpose agent frameworks like LangGraph that provide more architectural flexibility.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/ai-dev-tools/gadievron-raptor.svg)](https://starlog.is/api/badge-click/ai-dev-tools/gadievron-raptor)