Building Explainable Offensive Security with Agentic AI: A Deep Dive into Multi-Agent Reconnaissance

Hook

What if your security scanner could explain exactly why it chose to test specific endpoints, maintain cryptographic proof of every action, and optimize its attack path using simulated annealing—all while respecting legal boundaries?

Context

Traditional API security scanners operate like blunt instruments: they crawl every endpoint, fire every payload, and generate mountains of results with minimal context about why certain tests were prioritized. Tools like Nuclei excel at speed and template coverage, while Burp Suite offers comprehensive manual testing capabilities, but neither explains its reasoning or adapts its strategy based on what it learns during reconnaissance.

The Offensive_AI_CON_2025_Framework emerges from this gap, created by Kurtis Shelton for a talk at Offensive AI Con in San Diego. It represents an experimental approach to offensive security testing: what happens when you replace deterministic scanning logic with autonomous agents that plan, reason, and justify their actions? More critically, it tackles the elephant in the room for automated offensive tools—how do you build something powerful enough to find vulnerabilities but safe enough to deploy without triggering legal concerns or causing unintended damage? This framework attempts to answer that question through a multi-agent architecture with built-in safety rails, audit trails, and explainable decision-making.

Technical Insight

At its core, this framework implements a four-agent pipeline that mirrors human security testing workflows but adds probabilistic reasoning and optimization algorithms. The architecture follows a chain-of-responsibility pattern where each agent specializes in one phase: Discovery, Contract Inference, Planning, and Verification.

The Discovery agent builds what the framework calls a Probabilistic Endpoint Graph (PEG), which goes beyond simple URL crawling. Instead of just collecting paths, it fingerprints each endpoint using multi-modal features—response headers, latency patterns, content size distributions, and TLS characteristics. This creates a richer representation of the API surface. For example, two endpoints might have identical paths but drastically different latency profiles, suggesting one hits a database while the other serves cached content. The PEG captures these nuances, allowing downstream agents to make smarter decisions about where to focus testing effort.

Here's a simplified example of how the Contract Inference engine might work with the PEG output:

class ContractInferenceAgent:
    def infer_schema(self, endpoint_data, peg_features):
        # Start with prior assumptions from PEG fingerprints
        param_distributions = self._build_priors(peg_features)
        
        # Active sampling: test uncertain parameters first
        for param in self._prioritize_by_uncertainty(param_distributions):
            test_values = self._generate_probes(param)
            responses = self._execute_safe_probes(endpoint_data['url'], 
                                                   param, test_values)
            
            # Update beliefs using posterior-like estimation
            param_distributions[param].update(responses)
            
            # Early stopping if confidence threshold reached
            if param_distributions[param].confidence > 0.85:
                break
        
        return self._construct_schema(param_distributions)
    
    def _prioritize_by_uncertainty(self, distributions):
        # Information gain optimization: test what we know least about
        return sorted(distributions.keys(), 
                     key=lambda p: distributions[p].entropy(), 
                     reverse=True)

The key insight here is active sampling—instead of exhaustively testing every possible parameter value, the agent focuses on parameters where it has the highest uncertainty. This is borrowed from Bayesian optimization and dramatically reduces the number of requests needed to build a reliable API schema.

The Planner agent is where things get particularly interesting. It uses simulated annealing to optimize test case generation, treating verification steps as a discrete optimization problem. The planner considers multiple factors: likelihood of finding vulnerabilities, cost of execution (time/requests), and safety constraints. Simulated annealing allows it to escape local optima—for instance, it might initially focus on SQL injection tests for database-backed endpoints, but then 'jump' to authentication bypass attempts if the temperature parameter allows exploration.

def plan_verification(self, inferred_contracts, temperature=1.0):
    current_plan = self._generate_baseline_plan(inferred_contracts)
    current_score = self._evaluate_plan(current_plan)
    
    for iteration in range(self.max_iterations):
        # Generate neighbor by swapping/adding/removing test steps
        candidate_plan = self._mutate_plan(current_plan)
        candidate_score = self._evaluate_plan(candidate_plan)
        
        # Accept if better, or probabilistically if worse (exploration)
        delta = candidate_score - current_score
        if delta > 0 or random.random() < math.exp(delta / temperature):
            current_plan = candidate_plan
            current_score = candidate_score
        
        temperature *= self.cooling_rate
    
    return self._annotate_with_rationales(current_plan)

Notice the _annotate_with_rationales call at the end—every test step in the final plan includes an explanation of why it was selected. This explainability is crucial for both debugging and compliance, letting security teams understand and justify the scanner's decisions.

The Verifier Ensemble executes the planned tests through MCP-style typed adapters. This abstraction layer is elegant: the framework doesn't care whether you're using raw HTTP requests, Nuclei templates, or Burp's REST API—it just needs adapters that conform to a contract. The verification layer adds differential execution (running the same test through multiple tools and comparing results) and counterfactual validation (changing one variable and verifying the outcome changes as expected).

What really sets this framework apart is its safety layer. The Policy DSL lets you define boundaries using capability tokens—essentially permission slips for specific actions. A policy might grant the 'http.get' capability but deny 'http.post', or allow testing only within certain IP ranges. Rate limiting prevents runaway testing, and kill switches provide emergency stops. Every action generates an entry in an immutable audit log with cryptographic provenance chains, creating a verifiable record of what the framework did and why.

Gotcha

The biggest limitation is maturity. With only two GitHub stars and no community traction, this is clearly a proof-of-concept built for a conference talk, not a battle-tested production tool. The codebase quality, edge case handling, and real-world performance remain unknown. You're essentially adopting experimental research code, which means expect bugs, missing features, and potentially breaking changes.

The operational complexity is substantial. Setting up isolated lab environments, configuring Burp REST API access, writing Policy DSL files, and managing the dependencies for multiple verification tools creates significant friction. For small projects or quick assessments, this overhead dwarfs any benefit from 'intelligent' planning. The agentic architecture also adds latency—simulated annealing optimization and active sampling inference are computationally expensive compared to running a fixed template set through Nuclei. If you need to scan hundreds of APIs quickly, the planning overhead will become a bottleneck. Finally, the framework's reliance on external tools (Nuclei, Burp) means you inherit their limitations and licensing requirements. Burp Suite Professional isn't cheap, and if that's a dependency, you need to factor it into your evaluation.

Verdict

Use if: You're a security researcher or red team member exploring the intersection of AI agents and offensive security, you need explainable and auditable testing workflows for compliance-heavy environments, or you're willing to invest time customizing an experimental framework for specialized reconnaissance scenarios. The probabilistic endpoint mapping and information-gain optimization are genuinely novel contributions worth studying. Skip if: You need production-ready tooling for immediate deployment, you can't justify the operational overhead of isolated labs and policy configuration, or you simply want fast API vulnerability scanning—stick with established tools like Nuclei for speed, Burp Suite Professional for comprehensive testing, or OWASP ZAP for accessible open-source scanning. This framework is a research vehicle, not a Nuclei replacement, and your tool choice should match your goals accordingly.

Building Explainable Offensive Security with Agentic AI: A Deep Dive into Multi-Agent Reconnaissance

Building Explainable Offensive Security with Agentic AI: A Deep Dive into Multi-Agent Reconnaissance

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

Building Explainable Offensive Security with Agentic AI: A Deep Dive into Multi-Agent Reconnaissance

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// RELATED

Caldera: When Your Red Team Needs a Planning Algorithm, Not Just Another C2

Caldera: Building Adversary Emulation with Fact-Based Planning Engines

Inside Mathias Bynens' Dotfiles: The Blueprint for 30,000 macOS Developer Environments

Glow: Why Rendering Markdown in the Terminal Shouldn't Require a Browser

Caldera: When Your Red Team Needs a Planning Algorithm, Not Just Another C2

Caldera: Building Adversary Emulation with Fact-Based Planning Engines

Inside Mathias Bynens' Dotfiles: The Blueprint for 30,000 macOS Developer Environments

// CODEBASE INTELLIGENCE

Best for

Skip when