Brainstorm: Teaching LLMs to Predict Where Developers Hide Files

Hook

Traditional web fuzzers throw dictionaries at servers like a brute-force attack. Brainstorm asks an AI: ‘Given what we’ve found so far, where would a developer logically put the admin panel?‘

Context

Web application security testing has relied on wordlist-based fuzzing for decades. Tools like ffuf, gobuster, and dirb work by iterating through massive dictionaries of common paths—/admin, /backup, /config—hoping to stumble upon hidden endpoints. This approach has two fundamental problems: it’s noisy (generating thousands of 404s) and it’s context-blind (a React app and a legacy Java servlet get the same generic wordlist).

brainstorm takes a different approach by combining traditional fuzzing with local LLM inference through Ollama. Instead of blindly cycling through pre-built wordlists, it analyzes discovered paths, asks an AI model to predict what other files might logically exist, fuzzes those predictions with ffuf, then feeds successful discoveries back to the LLM for increasingly refined guesses. It’s fuzzing with memory and reasoning—a hybrid that promises to find developer-specific patterns that generic wordlists miss while generating far fewer requests than exhaustive enumeration.

Technical Insight

Brainstorm’s architecture is deceptively simple: it’s an orchestration layer sitting between Ollama (for LLM inference) and ffuf (for actual HTTP fuzzing). The tool doesn’t reinvent fuzzing mechanics—it delegates that to ffuf, one of the fastest directory brute-forcers available. What brainstorm contributes is the iterative learning loop that traditional fuzzers lack.

The workflow operates in cycles. First, it extracts initial links from the target website to establish baseline context. Then it constructs a prompt containing discovered paths and feeds it to the local LLM (defaulting to qwen2.5-coder:latest, a code-focused model). The LLM returns predicted paths based on patterns it recognizes—if it sees /api/users and /api/posts, it might suggest /api/comments or /api/admin. These predictions populate a temporary wordlist that ffuf then validates against the target. Successful discoveries (200s, 403s, 401s, redirects, even 500s—any response indicating the path exists) get appended to all_links.txt and fed back into the next LLM query. Rinse and repeat for up to 50 cycles by default.

Here’s how you’d run a basic fuzzing session:

# Install and pull the default model
ollama pull qwen2.5-coder:latest

# Run fuzzer against a target
python fuzzer.py "ffuf -w ./fuzz.txt -u http://target.com/FUZZ" --cycles 100

# For legacy systems with short filenames (8.3 format)
python fuzzer_shortname.py "ffuf -w ./fuzz.txt -u http://legacy.com/FUZZ" "REPORT~1.DOC"

The status code filtering is particularly thoughtful. By default, brainstorm considers responses with codes 200,301,302,303,307,308,403,401,500 as ‘successful’ discoveries. This isn’t just about finding accessible files—a 403 Forbidden tells you the path exists but requires authorization (valuable intel for privilege escalation testing), while a 500 error might indicate a real endpoint with buggy handling. You can override this with --status-codes if your target has unusual response patterns.

The tool includes two distinct fuzzers because different attack surfaces require different strategies. The main fuzzer.py focuses on general path discovery—finding /admin/debug.php or /api/v2/internal/metrics. The specialized fuzzer_shortname.py targets Windows short filename (8.3 format) discovery, useful for legacy IIS servers or systems with backward compatibility requirements. If you know a file like BENCHMARK.PY exists, the short filename fuzzer will generate variations like BENCHM~1.PY, BENCHM~2.PY, etc., and use the LLM to predict related short filenames.

Prompt engineering is critical here. The default prompt lives in prompts/files.txt and you can customize it with --prompt-file. The prompt needs to balance creativity (generate novel paths) with relevance (don’t hallucinate nonsense). The tool includes a benchmarking script (benchmark.py) that tests different Ollama models against the same target, generating an HTML report of discovery rates. According to published benchmark results available in the repository, models vary in effectiveness—code-focused models can outperform general-purpose chat models at predicting developer file structures.

The state management is straightforward but effective. Discovered paths accumulate in all_links.txt (or all_filenames.txt for short name fuzzing), and each LLM query includes the full history of successful discoveries. This means early-cycle generic predictions (based on limited context) evolve into late-cycle targeted guesses (based on the actual site structure). The LLM learns that this particular target uses /api/v1/ conventions or stores backups in /old_site/ because it’s seen evidence of those patterns.

Gotcha

The elephant in the room: this tool requires running LLM models locally that need to be downloaded first. The default qwen2.5-coder:latest model needs to be pulled through Ollama, which means downloading the model and having sufficient RAM to run inference at reasonable speed. If you’re on a system with limited resources, expect potential performance issues. There’s also the Ollama dependency itself—you need it running locally on port 11434, which adds setup friction compared to just running ffuf standalone.

More fundamentally, LLM predictions are probabilistic, not deterministic. The same prompt can yield different results across runs, and there’s no guarantee the AI will predict paths better than a well-curated wordlist like SecLists. If your target uses completely random or obfuscated naming schemes (/x7f3k9/upload.php), the LLM has nothing to pattern-match against and you’d be better off with comprehensive wordlists or manual reconnaissance. The tool shines when there’s logical structure to discover—RESTful APIs, framework conventions, developer naming patterns—but falls flat against chaos. And because it’s iterative, a bad early cycle (LLM generates nonsense, nothing discovered) means subsequent cycles start from a weak foundation. You could burn through 50 cycles and discover less than a single pass with a good static wordlist.

Verdict

Use brainstorm if you’re pentesting custom web applications where intelligent guessing could outperform brute-force enumeration—think internal corporate apps, bespoke SaaS platforms, or anywhere developers followed logical but non-standard naming conventions. It’s particularly valuable when you want to minimize request volume (LLM-guided guesses are more targeted than 100k-line wordlists) or when you’ve already exhausted standard wordlists and need a different approach. You’ll need local compute resources to run Ollama and comfort with the setup process (Python 3.6+, ffuf, Ollama, and the required Python packages). Skip it if you’re testing commodity software with well-known structures (WordPress, Joomla—just use standard wordlists), if you lack the hardware to run local LLMs smoothly, or if you’re optimizing for raw speed over intelligence (pure ffuf with a good wordlist is faster than LLM inference). Also skip if your targets use completely random naming schemes where pattern recognition offers no advantage. Brainstorm is a specialist tool: brilliant in the right context, unnecessary overhead in the wrong one.

Brainstorm: Teaching LLMs to Predict Where Developers Hide Files

Brainstorm: Teaching LLMs to Predict Where Developers Hide Files

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

// QUOTABLE

Brainstorm: Teaching LLMs to Predict Where Developers Hide Files

Hook

Context

Technical Insight

Gotcha

Verdict

// RELATED

Brainstorm: Teaching LLMs to Fuzz Web Applications Smarter Than Your Wordlist

HackingBuddyGPT: Teaching LLMs to Find Privilege Escalation Vulnerabilities

S3Scanner: How a 3,000-Star Go Tool Hunts Misconfigured Cloud Storage at Scale

awesome-llm-apps: A Pattern Library for Building AI Agents and RAG Systems

Brainstorm: Teaching LLMs to Fuzz Web Applications Smarter Than Your Wordlist

HackingBuddyGPT: Teaching LLMs to Find Privilege Escalation Vulnerabilities

S3Scanner: How a 3,000-Star Go Tool Hunts Misconfigured Cloud Storage at Scale

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

// QUOTABLE