Back to Articles

Mining Common Crawl: How Web Discovery Wordlists Beat Manual Curation

[ View on GitHub ]

Mining Common Crawl: How Web Discovery Wordlists Beat Manual Curation

Hook

Manual wordlists for web enumeration are guesswork. What if you could analyze every publicly crawled URL on the internet to know exactly which paths appear 99.9% of the time?

Context

Web application content discovery has traditionally relied on manually curated wordlists—collections of common filenames, directories, and URL paths that security professionals use during reconnaissance and penetration testing. Tools like DirBuster, Gobuster, and ffuf need these lists to systematically probe web servers for hidden content, backup files, admin panels, and forgotten endpoints. The problem? Manual curation is inherently biased, incomplete, and reflects what researchers think should exist rather than what actually does exist in production environments.

The lavalamp-/content-discovery-hit-lists repository takes a fundamentally different approach: instead of guessing which paths matter, it analyzes the actual architecture of the web using Common Crawl datasets. Common Crawl is a nonprofit that continuously crawls billions of web pages, creating petabyte-scale datasets of the public internet. By processing these massive archives with Hadoop-based analysis tools, this repository extracts the most frequently occurring URL patterns from real production systems, segmented by server type. The result is statistically validated wordlists that reflect the actual behavior of deployed web applications rather than theoretical assumptions.

Technical Insight

Distribution

Processing Pipeline

Raw WARC files

Parse HTTP headers

Extract URL paths

Generate wordlists

99.9% coverage lists

Enumerate paths

Common Crawl Data

S3 Storage

Hadoop MapReduce Jobs

LavaHadoopCrawlAnalysis

Server Type Classification

Apache/Nginx/IIS

Path Frequency Analysis

Statistical Aggregation

Coverage Calculation

Hierarchical Repository

dataset/server/coverage

Content Discovery Tools

ffuf/gobuster/dirsearch

Target Web Applications

System architecture — auto-generated

The repository's architecture is deceptively simple—it's a collection of text files organized hierarchically by Common Crawl dataset version, server type, and coverage percentage. But the intelligence lies in how these lists were generated. The wordlists derive from processing frameworks like LavaHadoopCrawlAnalysis, which runs distributed MapReduce jobs across Common Crawl data stored in Amazon S3. These jobs parse HTTP response headers to identify server types (Apache, Nginx, IIS), extract URL path components, and compute frequency distributions across millions of domains.

The coverage-based naming convention is particularly clever. A file labeled '99.9%' doesn't mean it contains 99.9% of all possible paths—that would be impossibly large. Instead, it means these paths collectively account for 99.9% of observed occurrences in the dataset. This statistical approach mirrors how content delivery networks optimize cache hit ratios: you don't need to cache every possible resource, just the ones that handle the vast majority of requests. For penetration testers, this means you can achieve high enumeration coverage with relatively compact wordlists.

Here's how you'd typically use these lists with a modern content discovery tool like ffuf:

# Using a high-coverage Apache-specific list
ffuf -w apache_generic/cc-2014-23.99.9.txt \
     -u https://target.example.com/FUZZ \
     -mc 200,204,301,302,307,401,403 \
     -t 50 \
     -H "User-Agent: Mozilla/5.0"

# Filtering out false positives by size
ffuf -w apache_generic/cc-2014-23.99.9.txt \
     -u https://target.example.com/FUZZ \
     -mc all \
     -fs 1234 \
     -t 50

The server-type segmentation provides targeting precision you don't get with generic wordlists. Apache servers historically handle URL rewriting differently than Nginx, expose different default files, and have distinct module-based directory structures. When fingerprinting reveals an Apache server, using the apache_generic list improves hit rates while reducing noise from paths that would never exist on that platform.

The statistical foundation also means these lists naturally prioritize paths by real-world importance. Manually curated wordlists often place equal weight on /admin and /supersecretbackdoor, but frequency analysis reveals that certain conventional paths appear orders of magnitude more often. The Hadoop processing pipeline essentially performs collaborative filtering across the entire public web—if thousands of Apache servers expose a particular path, it's probably worth checking first.

One underappreciated aspect is how this methodology captures framework-specific patterns without explicit framework detection. Ruby on Rails applications, for instance, have characteristic asset paths like /assets/application-[hash].js and route conventions. When Common Crawl encounters thousands of Rails apps across Apache servers, those patterns naturally bubble up in the frequency distribution. The same applies to WordPress installations, Laravel applications, and any other framework with sufficient deployment scale. You get framework fingerprinting as an emergent property of the statistical analysis rather than through manual enumeration of framework-specific paths.

The processing pipeline that generated these lists would look something like this in pseudo-Hadoop code:

// Mapper: Extract paths from Common Crawl WARC records
public class PathExtractionMapper extends Mapper<Text, WARCRecord, Text, IntWritable> {
    public void map(Text key, WARCRecord record, Context context) {
        String serverType = extractServerHeader(record.getHttpHeaders());
        String path = extractPath(record.getTargetURI());
        
        if (serverType != null && path != null) {
            // Emit composite key: serverType + path
            String compositeKey = serverType + "\t" + path;
            context.write(new Text(compositeKey), new IntWritable(1));
        }
    }
}

// Reducer: Count occurrences and calculate cumulative coverage
public class PathFrequencyReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    public void reduce(Text key, Iterable<IntWritable> values, Context context) {
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        context.write(key, new IntWritable(sum));
    }
}

The real work happens in post-processing, where results are sorted by frequency, grouped by server type, and truncated at various coverage thresholds to create the final wordlist files. This separation of data collection from analysis allows for regenerating lists with different statistical parameters without re-processing the entire Common Crawl dataset.

Gotcha

The elephant in the room is dataset age. The repository's examples reference Common Crawl dataset 'cc-2014-23', which is nearly a decade old at this point. Web development has evolved dramatically since 2014—single-page applications with client-side routing became dominant, GraphQL emerged as an API paradigm, JAMstack architectures proliferated, and containerized microservices changed deployment patterns. Paths that were common in 2014 might be irrelevant now, while modern application patterns won't appear in these lists at all. If you're testing contemporary web applications built with Next.js, SvelteKit, or modern Python frameworks like FastAPI, these lists will miss significant coverage.

The lack of documentation is also problematic for practical use. There's no specification for how paths are normalized—are query parameters stripped? How are URL-encoded characters handled? Are fragments included? Without this information, you might get unexpected misses during enumeration. The repository also doesn't provide guidance on which coverage percentage to use for different scenarios. Is 99.9% overkill for a quick assessment? Is 95% sufficient for comprehensive testing? These decisions significantly impact scan time and resource consumption, but you're left to develop intuition through trial and error.

Another limitation is the server-type granularity. Modern deployments often use reverse proxies (Nginx in front of application servers), CDNs that mask origin infrastructure, and containerization that abstracts server identity. The Server header that these lists rely on for categorization is increasingly unreliable or deliberately obscured. When your target returns a generic 'Server: cloudflare' header, which list do you use? The server-specific optimization becomes moot, and you're back to using generic lists anyway.

Verdict

Use if: You're conducting security assessments on legacy systems or established organizations where infrastructure likely predates 2015, you want to supplement existing wordlists with empirically-validated paths that reflect actual deployment patterns, or you're researching web application archaeology and need statistically rigorous data about historical URL conventions. The data-driven methodology is intellectually superior to manual curation, and for systems that haven't undergone major modernization, these lists still offer value.

Skip if: You're testing modern web applications built with contemporary frameworks, you need actively maintained wordlists that incorporate recent vulnerability disclosures and attack patterns, or you want comprehensive documentation and usage examples. The dated source data is a dealbreaker for current penetration testing work. Instead, look at assetnote/commonspeak2-wordlists for a similar Common Crawl-based approach with more recent data, or use SecLists for actively maintained, community-driven wordlists with broader coverage. This repository is best viewed as a historical artifact demonstrating an excellent methodology that deserves to be updated with modern datasets rather than a production-ready resource for contemporary security work.