Vulnhuntr: How LLMs Discovered Real Zero-Days That Static Analysis Missed

Hook

A Python static analysis tool just did what Semgrep, CodeQL, and traditional SAST couldn't: autonomously discover 8 confirmed zero-day vulnerabilities in projects with over 67,000 GitHub stars combined. No human guidance, no pre-written rules—just an LLM tracing execution paths across multiple files.

Context

Traditional static analysis tools operate on pattern matching. You write a rule that says "SQL queries constructed from user input are bad," and Semgrep dutifully flags every instance. But what about vulnerabilities that require tracing data flow through five function calls across three files, where the unsanitized input passes through helper functions that look innocuous in isolation?

This is the achilles heel of conventional SAST: multi-step vulnerabilities where user input enters through one endpoint, gets passed through utility functions that don't perform validation, and eventually reaches a dangerous sink in a completely different module. Human security researchers find these by manually following execution paths, building mental maps of how data flows through the system. Vulnhuntr automates this reasoning process using large language models, combining the structured code understanding of AST parsing with the contextual intelligence of tools like Claude and GPT-4. The result is a tool that can request additional context iteratively—"show me this function definition, now show me what calls it"—until it builds a complete vulnerability chain from entry point to exploit.

Technical Insight

Vulnhuntr's architecture revolves around a two-phase analysis workflow. First, it uses Jedi (Python's static analysis library) to parse your codebase into an AST and identify potential entry points—HTTP route handlers, API endpoints, user input processing functions. Then it feeds these entry points to an LLM with a specific prompt: find vulnerability chains within these seven classes (LFI, AFO, RCE, XSS, SQL injection, SSRF, IDOR).

What makes this interesting is the iterative context retrieval mechanism. The LLM doesn't receive your entire codebase upfront. Instead, it gets a file and can request additional context by asking for specific functions, class definitions, or imported modules. This happens through a tool-calling interface where the LLM says "I need to see the implementation of validate_user_input() from auth.py" and Vulnhuntr's orchestration layer fetches and injects that code into the conversation.

Here's a simplified example of how it traces a path traversal vulnerability:

# routes.py - Entry point
@app.route('/download/<path:filename>')
def download_file(filename):
    return file_service.get_file(filename)

# file_service.py - No validation
def get_file(filename):
    return _read_from_disk(filename)

# disk_utils.py - Dangerous sink
def _read_from_disk(path):
    with open(f"/var/data/{path}", 'r') as f:
        return f.read()

A rule-based tool might flag the open() call as suspicious, but it can't determine if path originates from user input three function calls ago. Vulnhuntr's LLM sees the route decorator in routes.py, recognizes filename comes from the URL path parameter, requests the definition of file_service.get_file(), follows it to _read_from_disk(), and identifies the complete chain: user-controlled input → no validation → file system operation.

The confidence scoring system adds practical value. Each finding gets a score from 1-10 based on how certain the LLM is about exploitability. Scores of 8+ indicate high-confidence vulnerabilities where the LLM traced a clear path with no sanitization. Scores below 7 are flagged as speculative—maybe the code looks suspicious but there's a validation layer the LLM couldn't fully verify. This helps prioritize remediation efforts and reduces false positive fatigue.

Vulnhuntr supports multiple LLM backends through a provider abstraction layer. Claude (specifically claude-3-5-sonnet-20241022) is the recommended choice because it's proven most effective at following complex call chains and producing structured output. GPT-4 works but with slightly lower accuracy. The experimental Ollama support for local models exists but performs poorly—open-source models struggle with the structured output format and multi-step reasoning required for vulnerability chain analysis.

The tool's real-world validation is compelling: it discovered remote code execution vulnerabilities in gpt_academic, arbitrary file overwrites in ComfyUI, path traversal issues in Ragflow, and multiple other CVEs in production AI/ML infrastructure. These weren't simple bugs that Bandit or Semgrep would catch—they were multi-file execution chains that required understanding how user input propagates through decorator patterns, async workflows, and framework abstractions common in modern Python web applications.

Gotcha

Vulnhuntr is locked to Python 3.10 specifically due to incompatibilities with the Jedi parsing library. Not 3.11, not 3.9—exactly 3.10. This is a significant operational constraint if your codebase targets newer Python versions, and it means you might need to maintain a separate environment just to run security scans. The maintainers acknowledge this as a known issue without a clear timeline for resolution.

The cost model can spiral quickly. Because Vulnhuntr attempts to maximize context in each LLM request (filling the context window with as much relevant code as possible), a single scan of a medium-sized codebase can consume thousands of API tokens. With Claude's pricing, analyzing a complex application could cost $50-200 depending on codebase size and vulnerability density. The docs explicitly warn you to set spending limits on your LLM provider accounts. This makes Vulnhuntr impractical for continuous integration pipelines where you'd run it on every commit—it's better suited for periodic deep-dive security audits.

The seven-vulnerability limitation is more restrictive than it first appears. While those classes cover common web vulnerabilities, Vulnhuntr won't find authentication bypasses, insecure deserialization, cryptographic failures, or business logic flaws. It's specialized for input-to-sink data flow vulnerabilities, not the full spectrum of application security issues. You still need complementary tools and manual review for comprehensive coverage.

Verdict

Use if: You're securing Python web applications or AI/ML infrastructure where complex, multi-file data flows create blind spots for traditional SAST tools. You have budget for commercial LLM APIs and can run scans on Python 3.10 codebases. You're hunting for novel vulnerability chains that connect user input to dangerous operations across multiple abstraction layers, especially in frameworks with heavy use of decorators, async patterns, or dynamic behavior. The tool excels in security audits of open-source dependencies and internal applications where the potential cost of a zero-day far exceeds the LLM API spend.

Skip if: You need broad language support beyond Python, can't use Python 3.10, or require comprehensive vulnerability detection including authentication, authorization, and cryptographic issues. Your security budget doesn't accommodate potentially expensive LLM API costs for exploratory scanning. You need CI/CD-integrated continuous scanning—the cost and runtime make this impractical for every commit. Traditional SAST tools like Semgrep or Bandit already cover your use case, or you're analyzing simple codebases without complex call chains where cheaper rule-based tools suffice.

Vulnhuntr: How LLMs Discovered Real Zero-Days That Static Analysis Missed

Vulnhuntr: How LLMs Discovered Real Zero-Days That Static Analysis Missed

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

Vulnhuntr: How LLMs Discovered Real Zero-Days That Static Analysis Missed

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// RELATED

Caldera: When Your Red Team Needs a Planning Algorithm, Not Just Another C2

Caldera: Building Adversary Emulation with Fact-Based Planning Engines

Inside Mathias Bynens' Dotfiles: The Blueprint for 30,000 macOS Developer Environments

Glow: Why Rendering Markdown in the Terminal Shouldn't Require a Browser

Caldera: When Your Red Team Needs a Planning Algorithm, Not Just Another C2

Caldera: Building Adversary Emulation with Fact-Based Planning Engines

Inside Mathias Bynens' Dotfiles: The Blueprint for 30,000 macOS Developer Environments

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]