Building Server-Specific Content Discovery Wordlists from Hadoop Crawl Data

Hook

The most effective content discovery wordlists aren't hand-curated—they're extracted from billions of real URLs, statistically filtered, and grouped by the web servers they actually work against.

Context

Security researchers and penetration testers have long relied on generic wordlists for content discovery—collections of common paths like /admin, /backup, and /api that might exist on target web servers. Tools like dirb, gobuster, and ffuf hammer these lists against servers, hoping to find hidden endpoints. But here's the problem: Apache servers have different common paths than nginx servers. WordPress installations expose different endpoints than Django applications. A one-size-fits-all wordlist wastes time testing paths that statistically don't exist on your target.

The lava-hadoop-processing tool solves this by creating server-specific hit lists from actual crawl data. Instead of guessing what paths might exist, it analyzes results from large-scale Hadoop-based web crawls (generated by the companion LavaHadoopCrawlAnalysis project), identifies which URL paths appear most frequently on which server types, and generates targeted wordlists at various percentile thresholds. This approach transforms raw distributed processing output into actionable security intelligence, bridging the gap between big data infrastructure and practical reconnaissance workflows.

Technical Insight

System architecture — auto-generated

The architecture operates in two distinct phases that separate aggregation concerns from analysis logic. The first phase consumes the part-* files that Hadoop MapReduce jobs write to S3—those familiar numbered output files like part-00000, part-00001 that contain reducer results. The tool expects these files to have already been synced locally, typically via aws s3 sync commands, because processing happens entirely in-memory on a single machine rather than in the distributed environment.

The aggregation phase reads through these part files, parsing lines that contain URL path segments paired with occurrence counts and server type identifiers. Here's what the core filtering logic looks like:

def aggregate_paths(part_files_dir, min_occurrence_threshold):
    server_paths = defaultdict(lambda: defaultdict(int))
    
    for part_file in glob.glob(f"{part_files_dir}/part-*"):
        with open(part_file, 'r') as f:
            for line in f:
                # Format: /path/segment TAB count TAB server_type
                path, count, server = line.strip().split('\t')
                count = int(count)
                
                if count >= min_occurrence_threshold:
                    server_paths[server][path] += count
    
    return server_paths

This threshold filtering is crucial—if a path only appeared 3 times across billions of URLs, it's likely noise, a typo, or server-specific rather than pattern-indicative. By setting a minimum occurrence threshold (often 100+ for large crawls), the tool eliminates long-tail garbage that would bloat wordlists without improving discovery rates.

The second phase generates percentile-based hit lists for each server type. This is where the statistical approach becomes powerful. Rather than including every path that met the threshold, it creates multiple wordlists at different coverage levels—50th percentile (most common paths), 75th, 90th, 95th, and 99th percentile lists. The reasoning: if you're testing a high-value target, you might want the comprehensive 99th percentile list despite longer scan times. For broad reconnaissance across many targets, the 75th percentile list gives you the best return on investment.

def generate_percentile_lists(server_paths, percentiles=[50, 75, 90, 95, 99]):
    hit_lists = {}
    
    for server_type, path_counts in server_paths.items():
        sorted_paths = sorted(path_counts.items(), 
                            key=lambda x: x[1], 
                            reverse=True)
        
        for pct in percentiles:
            cutoff_idx = int(len(sorted_paths) * (pct / 100.0))
            pct_paths = [path for path, _ in sorted_paths[:cutoff_idx]]
            
            output_file = f"{server_type}_p{pct}.txt"
            hit_lists[output_file] = pct_paths
            
    return hit_lists

The CLI interface uses argparse to expose these parameters, allowing researchers to tune thresholds and percentiles based on their crawl size and reconnaissance needs. The tool writes each server-type/percentile combination to a separate text file, ready for direct consumption by tools like ffuf or gobuster.

What makes this architecture interesting is its deliberate choice to be the "glue" between distributed and local processing. The Hadoop jobs handle the computational heavy lifting—parsing terabytes of HTTP responses, extracting paths, counting occurrences across billions of URLs—while this tool handles the analytical refinement that doesn't benefit from distribution. Processing 500MB of Hadoop output files locally is trivial on modern hardware, and keeping this phase local simplifies the toolchain considerably. You don't need to maintain Hadoop cluster access just to experiment with different threshold values or generate additional percentile cuts.

Gotcha

The elephant in the room: this tool is essentially useless without its companion LavaHadoopCrawlAnalysis project, and that project requires a fully operational AWS EMR cluster running Hadoop MapReduce jobs against crawl data. If you're a security researcher hoping to generate custom wordlists, you'll first need to set up distributed crawling infrastructure, obtain or generate large-scale web crawl data (think Common Crawl scale), run MapReduce jobs that can cost hundreds in AWS compute time, and only then can you use lava-hadoop-processing. For most practitioners, this barrier to entry is prohibitively high.

The repository shows signs of abandonment—only 2 stars, no recent commits, and dependencies on older AWS EMR workflows that may not align with current Hadoop ecosystem conventions. The in-memory processing approach, while pragmatic for moderate datasets, becomes a limitation when crawl results exceed available RAM. If your Hadoop analysis produced 50GB of part files, you're going to have a bad time loading everything into Python dictionaries. A streaming approach or database-backed intermediate storage would handle larger datasets gracefully, but the current implementation prioritizes simplicity over scalability. There's also no error handling for malformed input lines or validation that the expected TSV format is actually present, so corrupted Hadoop output will cause silent failures or cryptic exceptions.

Verdict

Use if: You're conducting large-scale security research with existing Hadoop crawl infrastructure and need to convert distributed analysis results into practical reconnaissance wordlists. The server-type grouping and percentile-based filtering provide genuinely useful refinement over raw crawl data, and if you already have the prerequisite pipeline running, this tool adds valuable post-processing capabilities. Skip if: You don't have Hadoop crawling infrastructure (which is 99% of security practitioners), need maintained and supported tooling, or want ready-to-use wordlists—in those cases, just download the curated lists from SecLists or the related content-discovery-hit-lists repository. The tight coupling to a specific, complex workflow makes this strictly a niche tool for researchers already invested in distributed web analysis ecosystems.

Building Server-Specific Content Discovery Wordlists from Hadoop Crawl Data

Building Server-Specific Content Discovery Wordlists from Hadoop Crawl Data

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

Building Server-Specific Content Discovery Wordlists from Hadoop Crawl Data

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// RELATED

Free-AI-Social-Media-Scheduler: A 2,000-Star Repository With Zero Lines of Code

jam-nodes: Type-Safe Workflow Nodes That Stop Before They Become an Orchestrator

Puppeteer: How Chrome's DevTools Protocol Became the Standard for Browser Automation

Inside awesome-selfhosted: How a 292K-Star GitHub List Became the Self-Hosting Movement's Central Nervous System

Free-AI-Social-Media-Scheduler: A 2,000-Star Repository With Zero Lines of Code

jam-nodes: Type-Safe Workflow Nodes That Stop Before They Become an Orchestrator

Puppeteer: How Chrome's DevTools Protocol Became the Standard for Browser Automation

// CODEBASE INTELLIGENCE

Best for

Skip when