OpenAnt: Why This Open-Source Security Tool Makes LLMs Prove Exploitability Before Crying Wolf
Hook
Most AI security scanners find thousands of 'vulnerabilities' in your codebase. OpenAnt's solution? Make the AI actually exploit them first—what survives is real.
Context
The AI-powered SAST market has a dirty secret: every vendor claims to find more vulnerabilities than their competitors, but nobody talks about false positive rates. Semgrep finds patterns. CodeQL proves reachability through dataflow analysis. Snyk Code uses ML to flag suspicious code. They all produce lists of potential issues that require manual security researcher triage to determine actual exploitability.
This creates a perverse incentive structure. Commercial tools optimize for recall (finding every possible issue) because missing a real vulnerability is catastrophic, but high false-positive rates burn security team time. OpenAnt, from security startup Knostic, takes a different approach borrowed from penetration testing: don't just detect potential vulnerabilities, attempt to exploit them. Stage 1 uses LLM reasoning to identify suspicious code patterns. Stage 2 spins up an agentic exploitation loop that writes actual attack payloads using tool-calling APIs. Only findings that survive automated exploitation attempts make it into the final report. It's SAST meets automated red-teaming, built on the thesis that LLMs are better at writing exploits than reasoning about abstract vulnerability conditions.
Technical Insight
OpenAnt's architecture splits orchestration and execution across language boundaries. A Go CLI acts as the state machine coordinator, managing project workspaces in ~/.openant/projects/<org>/<repo>/ with a project.json manifest and per-scan artifact directories. The actual vulnerability analysis happens in Python workers executing a six-stage pipeline: parse → enhance → analyze → verify → build-output → report. Each stage produces JSON artifacts consumed by the next, enabling resume-from-failure semantics if the LLM provider rate-limits you mid-scan.
The parse stage extracts code structure using language-specific parsers (tree-sitter for most languages, custom handling for Go and Python). Enhancement adds control-flow graphs and data dependency annotations. Analysis uses the LLM to identify potential vulnerabilities by reasoning over enhanced code representations. Here's where it gets interesting: the verify stage doesn't just ask the model "is this exploitable?"—it gives the model a tool-calling interface and challenges it to demonstrate exploitation.
The verification phase implements what Knostic calls an "agentic exploitation loop." The LLM gets a Protocol-based interface to actual execution primitives: file system access, network sockets, and a sandboxed interpreter for the target language. It can write exploit payloads, execute them against the vulnerable code, and observe results. Here's a simplified version of the tool-calling contract from the Python adapters:
class ExploitationTools(Protocol):
def read_file(self, path: str) -> str:
"""Read source file content"""
...
def execute_code(self, language: str, code: str) -> ExecutionResult:
"""Execute exploit payload in sandboxed environment"""
...
def make_request(self, url: str, method: str, payload: dict) -> Response:
"""Send HTTP request to running application"""
...
def report_success(self, evidence: str) -> None:
"""Mark vulnerability as verified with exploitation proof"""
...
The LLM iterates through this tool-calling loop, trying different exploitation approaches until it either succeeds (vulnerability confirmed), exhausts its reasoning budget (filtered out as likely false positive), or determines the issue isn't exploitable. For example, if the analysis phase flags a potential SQL injection, the verify phase might generate multiple payloads testing different escape sequences, observe database error messages, and attempt data exfiltration before marking it as confirmed.
Provider abstraction happens at two layers. The Go binary handles API key management and serializes model configurations, supporting Anthropic, OpenAI, Google, and OpenRouter. Python adapters implement the actual LLM client code with a shared Protocol for tool-calling. This split enables per-phase model selection—you can use Claude Opus for the expensive verify phase where reasoning quality matters, then switch to Gemini Flash for report generation where you're just formatting JSON into markdown.
The project embeds a default configuration using Claude across all phases to minimize setup friction, but the reality is more nuanced. The README explicitly calls out "cross-provider tool-call quirks" as an ongoing issue. Anthropic's Claude supports parallel tool calls and strict schema enforcement. OpenAI's models sometimes hallucinate tool names. Google's Gemini has different token limits for tool descriptions. These aren't abstracted away cleanly—swapping providers in the verify phase can produce different detection rates because the exploitation success depends on the model's ability to reason about tool usage, not just answer questions.
Token economics dominate the architecture. A medium-sized repository can consume millions of tokens per scan because each code file gets processed through multiple LLM calls: initial analysis, enhancement reasoning, vulnerability detection, and potentially dozens of tool-calling iterations during exploitation attempts. The per-phase provider selection exists specifically to manage costs—you might spend $50 on Opus calls for verification but only $2 on Flash calls for reports.
Gotcha
OpenAnt's exploitation-based validation has a fundamental limitation: false negatives are unbounded. If the LLM can't figure out how to exploit a vulnerability, it gets filtered out as a false positive. This works fine for straightforward issues like SQL injection or path traversal, but complex vulnerabilities requiring multi-step state manipulation often slip through. Race conditions are particularly problematic—the model needs to reason about concurrent execution and timing, then write exploit code that reliably triggers the race window. Cryptographic implementation flaws are another blindspot, since they require mathematical reasoning beyond pattern matching.
The filesystem-based state management creates operational friction in real-world CI/CD pipelines. Each scan is stored as an immutable artifact directory with no locking semantics for concurrent access. You can't run multiple scans of different branches simultaneously without risking state corruption in project.json. The serial pipeline execution means scanning a monorepo takes minutes to hours depending on size—fine for security researcher workflows, problematic for pull request checks. Token costs scale catastrophically with codebase size since there's no incremental scanning beyond commit-level granularity. Refactor a large file? Rescan the entire thing and burn another few thousand tokens.
Verdict
Use if: You're a security researcher or open-source maintainer who needs to audit third-party dependencies or legacy codebases for logic flaws that traditional SAST misses, you have the budget for LLM API costs (expect $20-100 per scan for medium repositories), and you value exploit validation over scan speed. This genuinely reduces false-positive triage time compared to CodeQL or Semgrep when you're hunting for real vulnerabilities rather than doing compliance checking. Skip if: You need CI/CD integration with sub-minute feedback loops, you're scanning massive monorepos where token costs would be prohibitive, you need guarantees around false negatives (the LLM-based exploitation has unbounded failure modes on complex vulnerabilities), or you're looking for memory corruption bugs that require fuzzing rather than logic-based exploits. This is a research tool for deep security audits, not a developer-facing linter.