Brainstorm: Teaching LLMs to Fuzz Web Applications Smarter Than Your Wordlist
Hook
What if your web fuzzer could look at /api/users/profile and intelligently guess that /api/orders/profile might exist—without a 50MB wordlist? That’s the promise of combining LLMs with traditional fuzzing.
Context
Web application security testing has relied on the same fundamental technique for decades: throw massive wordlists at a target and see what sticks. Tools like ffuf, gobuster, and dirb excel at this brute-force approach, leveraging dictionaries like SecLists that contain millions of paths curated from real-world discoveries. But this method has a glaring inefficiency—most paths in these generic wordlists are irrelevant to any given application. You’re fuzzing /wp-admin/ against a Node.js API, or /api/v1/ against a PHP monolith.
The fundamental problem is that wordlists are static and context-blind. They don’t understand that a Rails application uses /users/:id/edit patterns, or that your target’s API seems to namespace everything under /internal/. Penetration testers have always done this mental pattern-matching manually—spotting /api/v2/customers and then trying /api/v2/orders, /api/v2/products, etc. Brainstorm from Invicti Security attempts to automate this intuition by using local LLM models to generate context-aware path suggestions based on what’s already been discovered, creating a feedback loop where successful finds inform the next round of guesses.
Technical Insight
Brainstorm’s architecture is deceptively simple: it’s essentially a wrapper that orchestrates three components in a loop. First, it extracts initial links from the target (either from previous fuzzing results or initial discovery). Second, it feeds these links to a local Ollama LLM instance with a carefully crafted prompt asking it to generate new path suggestions based on observed patterns. Third, it writes these suggestions to a temporary wordlist file and invokes ffuf to test them. Any new discoveries get added to the pool of known links, and the cycle repeats.
The core workflow lives in fuzzer.py and looks something like this:
python fuzzer.py "ffuf -w ./fuzz.txt -u http://example.com/FUZZ" --cycles 100 --model qwen2.5-coder:latest
Under the hood, Brainstorm maintains state in all_links.txt, accumulating every successful path discovery. Each fuzzing cycle, it reads this file, constructs a prompt that includes these known paths, and sends it to Ollama’s API running locally on port 11434. The default model is qwen2.5-coder:latest, chosen specifically because code-focused models understand URL patterns and API structures better than general-purpose conversational models.
The prompt engineering is critical here. The tool uses templates from prompts/files.txt that guide the LLM to generate paths in the same structural style as what’s been found. If early discoveries reveal /api/v1/users and /api/v1/sessions, the LLM learns to suggest things like /api/v1/tokens, /api/v1/permissions, /api/v2/users—paths that respect both the URL structure and semantic naming conventions of the application.
The integration with ffuf is elegantly minimal. Brainstorm doesn’t reimplement fuzzing logic; it just generates the wordlist and shells out to ffuf with your exact command-line parameters. This means you get all of ffuf’s power while Brainstorm focuses solely on wordlist intelligence. The --status-codes parameter (defaulting to 200,301,302,303,307,308,403,401,500) determines what counts as a “discovery” worth feeding back into the LLM.
There’s also a specialized variant, fuzzer_shortname.py, that targets a very specific pentesting niche: discovering Windows 8.3 short filenames. These legacy filenames (like BENCHM~1.PY for benchmark.py) are still exposed on many IIS servers and can leak information about file structure even when directory listing is disabled. The short filename fuzzer uses a different prompting strategy focused on generating valid 8.3 format variations.
The feedback loop is where things get interesting. Unlike traditional fuzzers that exhaustively test a static list, Brainstorm’s effectiveness compounds over time within a single target. Early cycles might generate generic paths, but as it discovers application-specific patterns—custom API namespaces, unusual file extensions, specific parameter naming—the LLM’s suggestions become increasingly tailored. This is particularly powerful against custom-built applications where generic wordlists perform poorly.
One clever detail: the tool includes a benchmarking framework in benchmark.py that compares different Ollama models’ effectiveness at generating valid paths. The repository includes actual benchmark results showing that code-focused models significantly outperform general conversational models, validating the architectural choice of defaulting to qwen2.5-coder.
Gotcha
The elephant in the room is resource consumption. Ollama models aren’t lightweight—qwen2.5-coder:latest is a multi-gigabyte download, and running inference requires significant RAM and CPU. If you’re pentesting from a laptop or a resource-constrained VPS, you’ll feel it. The fuzzing speed becomes bottlenecked by LLM inference time rather than network I/O, which inverts the usual performance profile of web fuzzing.
More critically, there’s no guarantee that LLM-generated suggestions provide better coverage than a good static wordlist. If you’re testing a WordPress site, SecLists’ WordPress-specific wordlists represent thousands of researcher-hours of real-world discovery. An LLM, no matter how clever, might miss common plugins or standard admin paths that don’t follow obvious patterns. Brainstorm works best as a complement to traditional fuzzing, not a replacement—run your SecLists scan first, then let Brainstorm explore the patterns you found.
The effectiveness is also heavily prompt-dependent and model-dependent, which introduces non-determinism into your security testing. Run the same fuzzing campaign twice with different LLM models and you’ll get different results. This unpredictability is uncomfortable in a field where reproducibility matters. The benchmark results help, but they’re based on specific test cases that might not generalize to your target.
Finally, there’s a practical limitation around the initial corpus. Brainstorm needs discovered paths to learn from—the README describes it as extracting initial links from the target. If your target’s root path returns nothing but a 404, and your initial wordlist is weak, the LLM has no patterns to work with. This appears to be a genuine cold-start consideration, though specific performance characteristics aren’t detailed in the documentation.
Verdict
Use Brainstorm if you’re pentesting custom-built applications with unique URL structures where traditional wordlists repeatedly fall flat, or if you’re in a time-constrained engagement and want to intelligently explore discovered patterns without running 10-million-line wordlists. It shines when targeting bespoke APIs, internal tools, or legacy systems with unusual naming conventions. Also consider it if you’ve already done traditional fuzzing and want a second pass that thinks differently about the attack surface. Skip it if you’re doing initial reconnaissance against well-known frameworks (WordPress, Joomla, Django admin, etc.) where curated wordlists provide unbeatable coverage, if your testing environment lacks the resources to run local LLMs comfortably, or if you need deterministic, reproducible results for compliance reporting. Also skip it if you’re fuzzing at massive scale across hundreds of targets—the per-target learning approach doesn’t parallelize well, and you’ll be faster with traditional tools. Brainstorm is a scalpel, not a shotgun; use it when precision matters more than exhaustive coverage.