Back to Articles

Inside an Agentic API Security Framework: Probabilistic Graphs, Simulated Annealing, and the Future of Offensive Testing

[ View on GitHub ]

Inside an Agentic API Security Framework: Probabilistic Graphs, Simulated Annealing, and the Future of Offensive Testing

Hook

What if your security scanner could reason about uncertainty, plan cost-effective test sequences using thermodynamic optimization, and build cryptographic audit trails—all without hardcoded vulnerability signatures?

Context

Traditional API security scanners follow deterministic playbooks: crawl endpoints, fire template-based tests, parse responses for known signatures. This works for mature attack patterns but struggles with modern challenges—incomplete API documentation, sprawling microservice graphs, and the combinatorial explosion of parameter permutations. The Offensive_AI_CON_2025_Framework, a reference implementation accompanying Kurtis Shelton’s talk at Offensive AI Con 2025, explores a radically different approach: agentic AI that treats reconnaissance as probabilistic inference and test planning as constrained optimization.

This is a lab-safe reference implementation demonstrating how concepts from machine learning (Bayesian posterior inference, active sampling), operations research (simulated annealing), and distributed systems (cryptographic provenance chains) can reimagine offensive security workflows. The framework transforms a single URL into a reproducible, verifiable test case through five coordinated agents: a discovery crawler that builds a Probabilistic Endpoint Graph (PEG) with multi-modal fingerprinting, a contract inference engine that estimates API schemas under uncertainty, a meta-planner that optimizes verification steps using simulated annealing, execution adapters that interface with tools like Nuclei and Burp, and a verifier ensemble that validates findings through counterfactual analysis. It’s lab-safe by design, with a Policy DSL enforcing allowlists, rate limits, and kill switches.

Technical Insight

The architecture revolves around a multi-stage pipeline where each component contributes to a shared evidence store with cryptographic provenance. The discovery agent performs polite HEAD/GET crawls to map API surfaces, but instead of simple endpoint lists, it constructs a Probabilistic Endpoint Graph (PEG)—nodes represent endpoints, edges capture relationships (redirects, hypermedia links), and each node carries multi-modal fingerprints: response headers, latency distributions, payload sizes, and TLS handshake characteristics. This fingerprinting enables differential analysis later—if two endpoints share identical header signatures but differ in latency by 300ms, the framework flags potential backend architecture divergence.

The contract inference engine is where things get interesting. Rather than requiring OpenAPI specs, it derives schemas from observed traffic using what the README describes as ‘posterior-like inference.’ For each parameter, the engine estimates type (string, int, enum), requiredness (required, optional, conditional), and crucially, uncertainty. Fields with high uncertainty get tagged for active-sampling prioritization—the framework will generate targeted probes to reduce ambiguity. This is conceptually similar to Bayesian active learning: focus reconnaissance budget on uncertain regions of the parameter space. The output is artifacts/inferred_schema.json, a coarse contract that feeds the planner.

The meta-planner is the framework’s brain. Given the inferred schema and a policy file, it uses simulated annealing to optimize verification steps. The objective function balances three competing forces: information gain (will this test reveal new vulnerabilities?), cost (API rate limits, time budget), and risk (could this test trigger alerts or cause service disruption?). Simulated annealing—a thermodynamic optimization algorithm—allows the planner to explore suboptimal paths early (high temperature) before converging to locally optimal test sequences (low temperature). Each step comes with an explainable rationale stored in artifacts/plan.json. Here’s what a typical plan looks like:

python -m agentic_api.cli discover --base-url http://127.0.0.1:5000 --policy ./configs/policy.dsl
python -m agentic_api.cli infer
python -m agentic_api.cli plan --policy ./configs/policy.dsl --verify-only
python -m agentic_api.cli run --base-url http://127.0.0.1:5000 --policy ./configs/policy.dsl

The run command chains all stages, but splitting them exposes intermediate artifacts—critical for auditing and iteration. The policy DSL enforces safety constraints: host/method allowlists prevent accidental external scans, rate limits throttle requests, and a kill switch enables emergency shutdowns. This is enterprise-paranoia design, acknowledging that agentic systems can behave unpredictably.

Execution happens through MCP-style tooling adapters with typed contracts. The HTTP adapter supports differential-execution mode—send identical requests to two endpoints and compare responses byte-by-byte to detect inconsistencies. Nuclei and Burp adapters wrap external tools, translating plan steps into their native formats. The verifier ensemble performs counterfactual validation: if the framework claims an endpoint is vulnerable to injection, it generates a benign mutation and confirms the vulnerability disappears. This multi-axis confidence taxonomy (confirmed, probable, speculative) reduces false positives.

Evidence flows into an immutable audit log (artifacts/audit/audit.jsonl) with cryptographic provenance chains—each stage signs its output, creating a tamper-evident trail from raw traffic to final findings. Evidence is redacted before storage (no raw payloads in central logs), balancing forensic utility with data minimization. The framework even includes a drift detector that compares PEGs across time (python -m agentic_api.cli drift --old artifacts/peg_old.json --new artifacts/peg.json --threshold 0.9), flagging when API surfaces change unexpectedly—useful for detecting shadow deployments or configuration drift in microservice environments.

Gotcha

This is a conference demo reference implementation, not production-ready tooling. With two GitHub stars and minimal community adoption, the framework is designed for lab environments and research purposes. The README emphasizes ‘lab-safe usage’ with explicit warnings to ‘use only in isolated labs or with explicit, written authorization,’ but provides limited guidance on scaling to large API surfaces or integration into CI/CD pipelines.

The Burp and Nuclei integrations are adapters that require separate installation and licensing of these external tools. The README references a Policy DSL file (./configs/policy.dsl) in command examples, though the actual policy syntax and configuration options would need to be explored in the repository. The framework generates extensive artifacts (PEG, schemas, plans, evidence cards) but the README doesn’t detail storage requirements or performance characteristics for large-scale deployments. As a research artifact tied to a specific conference talk, this framework is best approached as architectural inspiration rather than a drop-in security solution.

Verdict

Use this framework if you’re researching agentic approaches to security testing, experimenting with AI-driven reconnaissance in isolated lab environments, or building custom tooling and need architectural inspiration—the Probabilistic Endpoint Graph, active-sampling prioritization, and simulated annealing planner are genuinely novel ideas worth studying. It’s also valuable if you attended (or want to understand) Kurtis Shelton’s Offensive AI Con 2025 talk and need a working reference implementation to explore the concepts hands-on. Skip it if you need production-ready API security scanning—established tools like OWASP ZAP, Burp Suite Professional, or Nuclei offer better reliability, community support, and integration ecosystems. Also skip if you’re uncomfortable with early-stage research code or lack the Python expertise to debug and extend it yourself. This framework represents an experimental approach to offensive security workflows designed for controlled laboratory environments.

// QUOTABLE

What if your security scanner could reason about uncertainty, plan cost-effective test sequences using thermodynamic optimization, and build cryptographic audit trails—all without hardcoded vulnera...

[ Tweet This ]
// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/developer-tools/spookalicious-offensive-ai-con-2025-framework.svg)](https://starlog.is/api/badge-click/developer-tools/spookalicious-offensive-ai-con-2025-framework)