Back to Articles

Mining Robots.txt for Security Intelligence: Inside RobotsDisallowed

[ View on GitHub ]

Mining Robots.txt for Security Intelligence: Inside RobotsDisallowed

Hook

The Robots Exclusion Protocol was designed to help search engines be polite. Instead, it became the internet's largest crowdsourced map of directories that website owners desperately want to hide.

Context

Every web security professional knows the first rule of reconnaissance: look at robots.txt. It's a public file meant to guide search engine crawlers away from duplicate content, staging environments, and resource-intensive pages. But in practice, it often reads like a treasure map with an X marking '/admin', '/backup', and '/private'. The security community has long recognized this paradox—by telling robots what not to index, you're essentially advertising your most sensitive directories to anyone who knows where to look.

Before RobotsDisallowed, security researchers had two options: manually browse individual robots.txt files during assessments, or use generic directory wordlists that weren't informed by real-world data. The RAFT project pioneered automated collection of these files, but it went unmaintained. Daniel Miessler's RobotsDisallowed filled this gap by systematically harvesting robots.txt disallow entries from the top 100,000 websites (originally Alexa, later Majestic Million), parsing them into usable wordlists, and—crucially—curating them. Rather than drowning users in noise, the project identifies the patterns that actually matter: those 500 paths that repeatedly appear across thousands of sites and contain high-signal keywords like 'admin', 'backup', 'login', and 'user'.

Technical Insight

Outputs

Processing

Extract domains

Request with Chromium UA

Raw robots.txt files

grep Disallow directives

Remove wildcards/queries

sort -u

Manual refinement

Majestic Million CSV

Shell Script Orchestrator

robots.txt Fetcher

Parser & Filter

Extraction Pipeline

Normalization & Dedup

Raw Wordlists

Curated Wordlists

Text File Outputs

System architecture — auto-generated

RobotsDisallowed's architecture is deceptively simple, but its simplicity is what makes it effective. The core is a shell script that downloads the Majestic Million CSV, extracts domains, and systematically requests robots.txt from each using a Chromium user-agent rather than identifying as a crawler. This is a critical technical decision—many sites serve different or sanitized robots.txt files to obvious bots.

The parsing logic extracts 'Disallow:' directives and applies intelligent cleanup. It strips query parameters, normalizes paths, removes duplicates, and filters out overly specific entries that won't generalize across targets. Here's the pattern for extracting disallow entries:

# Extract disallow entries from robots.txt
grep -i '^Disallow:' robots.txt | \
  sed 's/Disallow://i' | \
  sed 's/^[[:space:]]*//' | \
  sed 's/[[:space:]]*$//' | \
  grep -v '\*' | \
  grep -v '\?' | \
  sort -u

This pipeline does several important things. It case-insensitively matches 'Disallow' directives (since robots.txt parsing is case-insensitive per RFC), strips the directive itself, trims whitespace, and removes entries with wildcards or query parameters that won't work in most directory brute-forcing tools. The final 'sort -u' deduplicates across the entire corpus.

The real intelligence is in the curation phase. The project generates multiple wordlist sizes, but the crown jewel is 'curated.txt'—a manually refined list of approximately 500 entries. This curation applies keyword filtering for terms that indicate authentication boundaries, sensitive data, or administrative functions:

# Pseudo-code for curation logic
for path in all_disallowed_paths; do
  if [[ $path =~ (admin|user|login|password|backup|private|secret|config|api|internal) ]]; then
    add_to_curated_list $path
  fi
done

The output structure is pragmatic: plain text files, one path per line, sorted and deduplicated. No JSON, no XML, no complex formats—just the data in the exact format that directory enumeration tools expect. You can directly feed these into Gobuster, ffuf, or dirsearch:

# Using RobotsDisallowed with modern enumeration tools
gobuster dir -u https://target.com -w curated.txt -t 50

# Or with ffuf for more control
ffuf -u https://target.com/FUZZ -w top-10000.txt -mc 200,301,302,403

The project also maintains an archive structure, storing historical snapshots. This creates an underutilized resource—you can diff these snapshots to see how web application patterns evolve over time. Which directories are being protected more? Which are being exposed? This temporal dimension offers insight into industry-wide security posture changes.

One subtle architectural choice: the project stores raw collected data separately from curated output. This allows users to re-process the data with their own filtering criteria without re-harvesting from 100K sites. Want to extract only WordPress-specific paths? The raw data makes that trivial. Need paths related to a specific CMS? Build your own filter against the comprehensive dataset.

Gotcha

The elephant in the repository: the last commit was in March 2019. For a project whose value derives entirely from reflecting current web practices, being five years out of date is a death sentence for accuracy. Modern web applications—especially those built with React, Vue, or other JavaScript frameworks—often don't use robots.txt for directory protection at all. They use proper authentication, route guards, and API gateways. The prevalence of single-page applications, serverless architectures, and API-first designs means the patterns captured in 2019 increasingly miss the mark.

There's also a fundamental selection bias. This dataset only captures sites that misuse robots.txt for security—a practice that was already questionable in 2019. Sites with mature security programs use actual access controls, not security through obscurity. So you're learning patterns from the security practices of sites that are, by definition, doing it wrong. It's like learning to build houses by studying condemned buildings. Additionally, the project doesn't capture dynamic robots.txt files that change based on the requesting IP, user-agent, or other factors. Some sophisticated sites serve honeypot entries in robots.txt to detect reconnaissance activity. Using these wordlists blindly could trigger security alerts before you find anything useful.

Verdict

Use if: You're conducting web application penetration testing or bug bounty hunting and need battle-tested wordlists for initial directory enumeration, especially against older or legacy applications. The curated.txt file provides an efficient starting point that covers common patterns without the noise of massive generic wordlists. It's also valuable if you're researching historical security practices or need training data for understanding how organizations think about protecting sensitive paths. Skip if: You're testing modern JavaScript-heavy applications, need current data that reflects 2024 web architecture patterns, or want an actively maintained solution with community updates. Also skip if you're working in highly regulated environments where outdated reconnaissance data could cause you to miss critical paths that emerged in the last five years. Instead, combine this with actively maintained projects like SecLists, or build fresh datasets using similar techniques against current website rankings. Think of RobotsDisallowed as a historical artifact with residual utility, not a primary tool.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/developer-tools/danielmiessler-robotsdisallowed.svg)](https://starlog.is/api/badge-click/developer-tools/danielmiessler-robotsdisallowed)