BurpGPT: Teaching Your Security Scanner to Think With LLMs

Hook

Traditional vulnerability scanners flag SQL injection patterns with near-perfect accuracy, yet they completely miss the business logic flaw that lets users delete other people's accounts by changing a single URL parameter. This is exactly the gap BurpGPT was designed to fill.

Context

Web application security testing has historically relied on pattern matching and signature-based detection. Tools like Burp Suite excel at finding common vulnerabilities—SQL injection, XSS, CSRF—by recognizing known attack patterns in HTTP traffic. But modern applications fail in more subtle ways: authorization bypasses through parameter tampering, business logic flaws in multi-step workflows, API misconfigurations that expose sensitive data through unintended endpoints.

These bespoke vulnerabilities require contextual understanding of what an application is trying to do versus what it actually does. A scanner can detect that a JWT is present, but it takes reasoning to understand that the application trusts client-supplied user IDs without server-side validation. Enter BurpGPT: a Burp Suite extension that pipes HTTP traffic through OpenAI's GPT models, asking an AI to reason about security implications rather than just match patterns. It's not replacing traditional scanners—it's adding a layer of contextual analysis that complements them.

Technical Insight

BurpGPT operates as a Burp Suite extension built on the Montoya API (the modern replacement for the legacy Extender API), requiring Burp version 2023.3.2 or later. The architecture is surprisingly straightforward: it hooks into Burp's passive scanner, intercepts HTTP request/response pairs, formats them according to user-defined prompts, sends them to OpenAI's API, and injects the AI's analysis back into Burp's issue tracker as informational findings.

The heart of the system is its placeholder-based prompt engine. Users define prompts with placeholders like {{request}}, {{response}}, {{url}}, and {{headers}} that get replaced with actual traffic data. Here's a practical example of how you might configure a prompt to hunt for IDOR (Insecure Direct Object Reference) vulnerabilities:

Analyze this HTTP transaction for authorization issues:

URL: {{url}}
Method: {{method}}

Request:
{{request}}

Response:
{{response}}

Focus on:
1. Parameters that appear to be object identifiers (user IDs, account numbers, resource IDs)
2. Whether the response contains data that might belong to other users
3. Whether authorization checks seem to be missing based on predictable parameters
4. Any signs of horizontal or vertical privilege escalation opportunities

Provide specific exploitation steps if a vulnerability is found.

This prompt gets sent to GPT-4 or GPT-3.5-turbo for every HTTP transaction Burp processes (within your configured scope). The AI responds with natural language analysis, which BurpGPT parses and inserts as a Burp Scanner issue. The genius is in the flexibility—you can completely reshape the prompt to focus on different vulnerability classes depending on your testing phase.

Token management is critical when you're processing hundreds or thousands of requests. BurpGPT implements a configurable token budget system. You set a maximum token limit per request (typically 4000-8000 for GPT-3.5-turbo, up to 32000 for GPT-4), and the extension truncates request/response bodies to fit within those bounds. This is where you face real trade-offs: truncate too aggressively and you lose critical context; set limits too high and your API costs explode.

The extension integrates with Burp's native passive scanner framework, meaning findings appear in the Target -> Site map and Dashboard -> Issue activity views just like any other scanner result. From a workflow perspective, this is crucial—security testers don't need to learn a new interface or check multiple tools. Here's the basic flow from an implementation perspective:

// Simplified conceptual flow (not actual BurpGPT code)
public class BurpGPTPassiveScanner implements PassiveScanCheck {
    
    @Override
    public List<IScanIssue> doPassiveScan(IHttpRequestResponse baseRequestResponse) {
        String prompt = buildPromptFromTemplate(baseRequestResponse);
        String gptAnalysis = sendToOpenAI(prompt);
        
        if (containsVulnerabilityIndicators(gptAnalysis)) {
            return List.of(createBurpIssue(
                "GPT-Identified Potential Vulnerability",
                gptAnalysis,
                baseRequestResponse
            ));
        }
        return Collections.emptyList();
    }
    
    private String buildPromptFromTemplate(IHttpRequestResponse reqResp) {
        String template = getUserConfiguredPrompt();
        return template
            .replace("{{url}}", reqResp.getUrl().toString())
            .replace("{{request}}", new String(reqResp.getRequest()))
            .replace("{{response}}", truncateToTokenBudget(reqResp.getResponse()));
    }
}

The extension's Java/Gradle foundation makes it maintainable and distributable through standard Burp extension mechanisms. You can load it via the BApp Store (for the community version) or manually via the Extensions tab in Burp. Configuration happens through a dedicated UI panel where you set your OpenAI API key, select the model, configure prompts, and set token limits.

One architectural decision worth noting: BurpGPT processes requests synchronously, meaning each HTTP transaction waits for GPT's response before moving on. With GPT API latencies ranging from 1-10 seconds depending on prompt complexity and load, this can slow down your testing workflow significantly if you're analyzing high-traffic applications. The extension doesn't implement request queuing or batch processing, which would be the obvious enhancement for production use.

Gotcha

The elephant in the room is data exfiltration. Every HTTP request and response you analyze gets sent to OpenAI's servers. If you're testing applications with PII, healthcare data, financial information, or anything subject to regulatory compliance (GDPR, HIPAA, PCI-DSS), you're potentially violating those regulations by transmitting that data to a third party. OpenAI's terms state they don't use API data for training, but that doesn't address data residency requirements or the simple fact that sensitive data is leaving your control.

The second major limitation is that the community edition—the open-source version in this repository—is explicitly discontinued. The README contains warnings that it's no longer maintained and may not function with current Burp Suite versions. The developers pivoted to a commercial 'Pro' version with additional features and ongoing support. While the source code remains available for learning and modification, you're essentially on your own if you want to use it in production. You'd need to fork it and maintain it yourself, which defeats the purpose of using a pre-built tool.

Finally, the quality of findings is entirely dependent on your prompt engineering skills and the inherent limitations of LLMs. GPT models hallucinate, produce inconsistent results, and sometimes misinterpret context. You'll get false positives that require manual triage, which adds analyst time rather than saving it. The tool works best as a hypothesis generator—it surfaces interesting possibilities that you then verify manually. If you're expecting it to definitively identify vulnerabilities with high confidence, you'll be disappointed.

Verdict

Use if: You're testing non-sensitive applications where data transmission to OpenAI is acceptable, you have budget for API costs ($0.01-0.03 per request adds up fast), and you're hunting for business logic flaws or bespoke vulnerabilities that pattern-based scanners miss consistently. It shines in scenarios where you need contextual reasoning—unusual parameter combinations, multi-step workflow issues, or subtle authorization problems. Also use it if you're willing to invest time in prompt engineering and treat it as an augmentation tool rather than a replacement for manual analysis. Skip if: You're working with any regulated or sensitive data, you need privacy guarantees, you're testing high-traffic applications where API latency would bottleneck your workflow, or you don't have budget for both OpenAI API costs and potentially the Pro license. Also skip if you're looking for a set-it-and-forget-it solution—this tool requires active management, prompt tuning, and manual verification of findings. For most enterprise environments, the data exfiltration risk makes this a non-starter unless you're willing to fork it and integrate with self-hosted LLM alternatives.

BurpGPT: Teaching Your Security Scanner to Think With LLMs

BurpGPT: Teaching Your Security Scanner to Think With LLMs

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

BurpGPT: Teaching Your Security Scanner to Think With LLMs

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// RELATED

Trivy's Monolithic Architecture: Why a 500MB SQLite Database Beats Microservices for Security Scanning

OpenAnt: Why This Open-Source Security Tool Makes LLMs Prove Exploitability Before Crying Wolf

Caldera: When Your Red Team Needs a Planning Algorithm, Not Just Another C2

Caldera: Building Adversary Emulation with Fact-Based Planning Engines

Trivy's Monolithic Architecture: Why a 500MB SQLite Database Beats Microservices for Security Scanning

OpenAnt: Why This Open-Source Security Tool Makes LLMs Prove Exploitability Before Crying Wolf

Caldera: When Your Red Team Needs a Planning Algorithm, Not Just Another C2

// CODEBASE INTELLIGENCE

Best for

Skip when