Burpference: Teaching Burp Suite to Think with Local and Cloud LLMs
Hook
What if your web proxy could spot authentication bypasses and injection flaws without a single pre-written signature—just by reasoning about HTTP traffic like a security researcher would?
Context
Traditional web application security testing relies on pattern matching: regex-based scanners looking for known strings, signature databases hunting familiar exploit patterns, and manual testers following checklists. This works until it doesn’t—novel vulnerabilities, business logic flaws, and context-dependent security issues slip through because they don’t match known signatures. Meanwhile, large language models have proven capable of understanding code semantics, API behaviors, and security principles at an abstract level. But these two worlds rarely intersect during live testing.
Burpference bridges this gap by embedding LLM-powered analysis directly into Burp Suite’s proxy workflow. Instead of exporting HTTP archives for offline analysis or copy-pasting requests into ChatGPT, security testers get real-time AI feedback as they browse target applications. The extension transforms Burp from a passive traffic observer into an active reasoning engine, forwarding in-scope requests to configurable LLM endpoints and surfacing findings with the same severity-based color coding analysts already use. It’s a research project from Dreadnode that asks: what happens when we give offensive security tools the ability to reason rather than just match patterns?
Technical Insight
At its core, Burpference is a Jython-based Burp Suite extension that intercepts HTTP traffic from Burp’s proxy history. When Burp processes a response, the extension filters for in-scope items, excludes certain MIME types that cannot be meaningfully analyzed, and packages the request-response pair into JSON. This payload gets forwarded to a configured LLM endpoint—OpenAI, Anthropic, or a local Ollama instance—along with a security-focused system prompt that instructs the model to act as a penetration tester.
The architecture separates concerns cleanly: configuration files in the /configs directory define API endpoints, authentication, and model selection, while prompt templates in /prompts encode different analysis strategies. Want to focus on authentication bypass? Load the auth-focused prompt. Hunting for injection vectors? Switch to the injection template. The extension reads these at runtime without requiring code changes. Configuration files specify the provider name, base URL, API key, and model selection.
When a response arrives, Burpference constructs an API request in a format compatible with major LLM providers, making it provider-agnostic. The extension sends HTTP traffic context to the LLM, which analyzes this data against its security knowledge and returns findings structured with severity levels.
The Scanner tab extends this further by offering targeted analysis. You can paste a URL or OpenAPI specification, and the extension will extract endpoints, security headers (including X-Frame-Options, Content-Security-Policy, Strict-Transport-Security, and others), and server information before sending everything to the LLM. This is particularly powerful for API-first applications where traditional scanners struggle with complex authentication flows or business logic.
Findings appear in two places: a color-coded Inference Logger tab showing real-time results (with critical/high/medium/low/informational color coding), and native Burp Scanner issues. The extension parses the LLM’s response for severity keywords and creates corresponding issue entries that appear alongside traditional scanner findings. This means AI-discovered vulnerabilities live in the same workflow as standard security detections.
The persistent storage layer writes all findings to timestamped JSON files in /logs/findings.json, creating an audit trail. Each finding includes timestamp, severity, details, affected URLs, and host information. Findings persist across Burp Suite sessions, maintaining continuity across days-long engagements. The extension also tracks API call history in the Inference Logger tab and in timestamped files, letting you review exactly what was sent to the LLM and what came back.
One key optimization: Burpference only processes responses for URLs within Burp’s scope definition. If you’re testing a specific domain, third-party scripts never hit the API. This prevents token waste on irrelevant traffic and keeps costs predictable. MIME type filtering further excludes content that LLMs cannot meaningfully analyze, though the specific excluded types are documented in the codebase.
Gotcha
The biggest friction point is Jython itself. Burp Suite runs on Java, so Python extensions require the Jython standalone JAR as an interpreter bridge. This means you need to download Jython and configure Burp to use it, though you don’t need a Python 2.x runtime installed in your environment. The setup process—downloading Jython, pointing Burp to the JAR via the Extender options, then loading the extension—adds deployment complexity compared to pure Java extensions. If you’ve never configured Jython in Burp before, expect to spend time ensuring Burp has necessary filesystem permissions.
API key management is awkward by design: Burp Suite extensions cannot read from a filesystem’s OS environment variables, so your OpenAI or Anthropic keys must be explicitly included in plaintext configuration JSON files. The README explicitly warns about gitignoring these configs and includes a pre-commit hook as a safety net, but it’s an uncomfortable security posture for a tool built for security testing. If you’re using local Ollama models, this isn’t an issue, but cloud API users need careful credential hygiene—one accidental git commit exposes your keys.
Resource consumption can be significant. The README notes that burpference may require higher system resources to run optimally, especially when using local models. If you’re testing a JavaScript-heavy single-page application that makes hundreds of API calls per page load, Burpference will attempt to analyze every in-scope response. Cloud API usage could potentially accumulate costs during busy testing sessions, though the scope filtering and MIME type exclusions help mitigate this. Local Ollama deployments avoid API costs but require adequate system resources for usable inference speeds. Rate limiting from cloud providers can also affect testing throughput if you exceed quotas.
Finally, this is explicitly labeled a research project, not production software. The README describes it as “a research idea of offensive agent capabilities” and “a fun take on Burp Suite.” While it includes comprehensive logging and error tracking, you should treat it as an experimental tool for exploring AI-assisted security testing rather than a replacement for established scanning methodologies.
Verdict
Use Burpference if you’re conducting manual penetration tests on complex web applications where business logic vulnerabilities and subtle authentication flaws hide in plain sight. It excels when you want a second set of eyes reviewing HTTP traffic in real-time, especially for APIs with intricate authorization schemes that signature-based scanners miss entirely. Security researchers exploring offensive AI capabilities will find it a valuable testbed for prompt engineering and model comparison—swap between GPT-4, Claude, and local Llama models to see which spots different vulnerability classes. If you’re already comfortable with Burp Suite customization and have access to adequate system resources or a budget for API calls, the integration offers real-time AI analysis during testing. Skip it if you’re testing simple CRUD applications where Burp’s native scanner already covers most findings, if your engagement prohibits sending HTTP traffic to third-party APIs (even for in-scope items), or if you need battle-tested stability for time-sensitive assessments. Also skip if you’re new to Burp Suite entirely—learn the core tool first before adding AI complexity. For teams with strict compliance requirements around data handling, the plaintext API key storage and potential for cloud data transmission makes this unsuitable unless you commit to fully local Ollama deployments. Remember this is a research project, so approach it as an experimental enhancement to your testing methodology rather than a primary security tool.