Back to Articles

Brainstorm: Teaching LLMs to Predict Hidden Web Endpoints

[ View on GitHub ]

Brainstorm: Teaching LLMs to Predict Hidden Web Endpoints

Hook

Most web fuzzers blindly guess paths from static dictionaries. Brainstorm watches what it finds, feeds that context to an LLM, and asks: "What else would probably exist here?"

Context

Traditional web fuzzing follows a simple formula: throw thousands of common paths at a target and see what sticks. Tools like ffuf and gobuster excel at this brute-force approach, cycling through wordlists containing /admin, /backup, /api/v1, and thousands of other likely endpoints. This works beautifully for standard frameworks and common patterns—until it doesn't.

The problem emerges with custom applications. A bespoke internal tool might use /employee-records/export-csv instead of /admin/users/export. A microservice could expose /v2/analytics/customer-retention that no wordlist would ever guess. Traditional fuzzers sit at a fundamental disadvantage: they can't reason about what they're seeing. They can't notice that discovering /reports/2023/january.pdf probably means /reports/2024/january.pdf exists too. Brainstorm emerged from this gap—the realization that language models, trained on vast corpora of code and documentation, might excel at predicting application-specific paths when given context about what's already been found.

Technical Insight

Brainstorm operates as a feedback loop between reconnaissance and LLM-powered prediction. The architecture centers on three core components: a link extraction phase using standard HTTP requests, an Ollama-based LLM inference engine, and ffuf as the fuzzing backend.

The iterative cycle begins with crawling. Brainstorm fetches the target URL and extracts all discovered paths—links from HTML, script sources, API endpoints referenced in JavaScript, anything that reveals application structure. This discovered context becomes training data for the LLM.

Here's where it gets interesting. Instead of maintaining a static wordlist, Brainstorm constructs a dynamic prompt that feeds discovered paths to a local Ollama model:

# Simplified example of the LLM prompting logic
discovered_paths = [
    '/api/users/profile',
    '/api/users/settings',
    '/dashboard/analytics'
]

prompt = f"""Given these discovered web paths:
{chr(10).join(discovered_paths)}

Predict 50 additional paths that likely exist on this web application.
Consider REST conventions, common patterns, and logical variations.
Respond with only the paths, one per line."""

response = ollama_client.generate(
    model='llama2',
    prompt=prompt,
    temperature=0.7
)

predicted_paths = response.split('\n')

The LLM's predictions become a custom wordlist for ffuf. If the model sees /api/users/profile and /api/users/settings, it might predict /api/users/preferences, /api/users/notifications, or /api/admin/users—paths that combine observed patterns with learned knowledge about REST API conventions.

After ffuf completes its fuzzing run with these predicted paths, any newly discovered endpoints feed back into the next iteration. The cycle repeats, with each round benefiting from accumulated context. This creates an adaptive system that narrows its focus based on what actually exists on the target.

Brainstorm includes a particularly clever variant for Windows 8.3 short filename discovery. Legacy Windows systems generate truncated filenames (PROGRA~1 for Program Files), which can expose sensitive files. The tool uses LLM predictions to guess likely long filenames, converts them to short name format, then fuzzes for those specific patterns—a task that would require enormous wordlists with traditional approaches.

The benchmarking framework deserves attention too. Brainstorm can evaluate different Ollama models (Llama2, Mistral, CodeLlama) against known targets to measure prediction accuracy. This lets security teams identify which models perform best for their specific use cases:

# Benchmark different models
for model in ['llama2', 'mistral', 'codellama']:
    predictions = generate_predictions(model, discovered_paths)
    actual_hits = fuzz_with_ffuf(predictions)
    
    accuracy = len(actual_hits) / len(predictions)
    print(f"{model}: {accuracy:.2%} hit rate")

The architectural decision to use local Ollama models rather than cloud APIs is significant. It eliminates API costs, prevents sensitive target information from leaving the tester's infrastructure, and allows unlimited iterations without rate limiting. The trade-off is setup complexity—you need several gigabytes of disk space for models and sufficient RAM for inference—but for penetration testing workflows, this privacy-first approach makes sense.

One subtle but powerful feature: Brainstorm maintains state across iterations, avoiding duplicate predictions and focusing the LLM on unexplored areas. This prevents the model from repeatedly suggesting the same variations and ensures each cycle explores new hypothesis space.

Gotcha

The performance characteristics can surprise you. LLM inference isn't free—even local models take seconds to generate predictions, and if you're running on CPU rather than GPU, each iteration might take 30-60 seconds. For quick reconnaissance scans, this overhead makes Brainstorm slower than just running ffuf with SecLists. The tool shines during deep assessment of specific targets, not mass scanning.

Model quality varies dramatically. Smaller models might generate nonsensical paths or miss obvious patterns, while larger models require more RAM and slower inference. You'll need to experiment with different Ollama models to find the sweet spot for your hardware and use case. CodeLlama often performs well for API endpoint prediction due to its code-focused training, but it's not a universal solution. The effectiveness also depends heavily on having enough discovered paths to establish patterns—if the initial crawl only finds two or three endpoints, the LLM has insufficient context to make intelligent predictions.

Dependency management adds friction. You need Ollama running as a service, ffuf in your PATH, and enough disk space for multi-gigabyte model files. This isn't a single-binary tool you can drop onto a system and run. For teams already using LLMs locally, integration is straightforward. For others, the setup investment might not justify the returns on smaller engagements.

Verdict

Use if: You're conducting thorough penetration tests against custom web applications where traditional wordlists consistently miss application-specific endpoints, you have local compute resources to run LLM models, and you value intelligent path prediction over raw fuzzing speed. Brainstorm excels when you need depth over breadth—deeply mapping a single complex application rather than quickly scanning hundreds of targets. It's particularly valuable for bug bounty hunters working on unique platforms and red teams who need to discover obscure endpoints that developers assumed were hidden. Skip if: You're doing rapid reconnaissance across many targets, working with standard frameworks where comprehensive wordlists already exist, or operating in resource-constrained environments without GPU acceleration. For most routine web enumeration, ffuf with a quality wordlist like SecLists remains faster and simpler. The LLM overhead only pays off when traditional approaches hit diminishing returns.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/llm-engineering/invicti-security-brainstorm.svg)](https://starlog.is/api/badge-click/llm-engineering/invicti-security-brainstorm)