Back to Articles

Mining CommonCrawl's Petabyte Archive for Forgotten URLs with cc.py

[ View on GitHub ]

Mining CommonCrawl's Petabyte Archive for Forgotten URLs with cc.py

Hook

CommonCrawl has indexed over 250 billion web pages since 2008, creating a permanent record of URLs that developers thought they'd deleted. A single Python script can expose all of them for any domain in minutes.

Context

Security researchers and penetration testers face a fundamental reconnaissance challenge: discovering the complete attack surface of a target domain. Modern web applications constantly evolve—endpoints get deprecated, admin panels move, staging environments leak into production, and sensitive paths get removed from sitemaps but remain accessible. Traditional crawlers only see what's currently linked, and manual enumeration is impossibly time-consuming for mature applications with years of technical debt.

CommonCrawl.org addresses this by maintaining a free, public archive of web crawl data dating back to 2008, updated monthly with petabytes of new content. But querying this massive distributed dataset efficiently requires understanding its index structure and API patterns. cc.py emerged as a focused command-line tool that abstracts away CommonCrawl's complexity, allowing security professionals to extract every historical URL for a target domain with a single command. Unlike full-featured web reconnaissance suites, it does one thing well: translating domain names into comprehensive URL inventories by mining this archival goldmine.

Technical Insight

cc.py's architecture centers on efficiently querying CommonCrawl's Columnar Index, a distributed database that maps domains to specific WARC (Web ARChive) file locations. When you execute the tool, it follows a three-phase workflow: index discovery, URL extraction, and output aggregation.

The tool first hits CommonCrawl's index server at index.commoncrawl.org to fetch available index collections. Each collection represents a monthly crawl (e.g., CC-MAIN-2023-14 for the March/April 2023 crawl). Here's how the core request pattern works:

# Simplified version of cc.py's index query logic
import requests
from urllib.parse import quote

def fetch_urls(domain, index_name):
    encoded_domain = quote(domain)
    index_url = f"https://index.commoncrawl.org/{index_name}-index"
    
    # Query the CDX server API with domain filter
    params = {
        'url': f'*.{domain}',  # Wildcard to catch all subdomains
        'output': 'json',
        'filter': 'statuscode:200',  # Only successful responses
    }
    
    response = requests.get(index_url, params=params, stream=True)
    
    # Process streaming JSON records
    for line in response.iter_lines():
        if line:
            record = json.loads(line)
            yield record['url']

The critical architectural decision here is streaming: CommonCrawl indexes can contain millions of records for popular domains. Rather than loading everything into memory, cc.py processes the NDJSON (newline-delimited JSON) response line-by-line, yielding URLs as it goes. This prevents memory exhaustion when querying domains like github.com that might return 50+ million historical URLs.

The version 2.0 rewrite introduced multithreading to parallelize index queries across different CommonCrawl collections. When you don't specify a particular year or index, the tool spawns worker threads to simultaneously query multiple indexes:

from concurrent.futures import ThreadPoolExecutor, as_completed
import tempfile

def parallel_fetch(domain, indexes, max_workers=10):
    temp_files = []
    
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        # Submit each index query as a separate thread
        future_to_index = {
            executor.submit(fetch_urls, domain, idx): idx 
            for idx in indexes
        }
        
        for future in as_completed(future_to_index):
            index_name = future_to_index[future]
            
            # Write results to temp file to avoid memory bloat
            temp_file = tempfile.NamedTemporaryFile(mode='w', delete=False)
            for url in future.result():
                temp_file.write(f"{url}\n")
            temp_file.close()
            temp_files.append(temp_file.name)
    
    return temp_files

This threading model achieves the advertised 65% performance improvement by overlapping network I/O across multiple indexes. The temporary file strategy is particularly clever: rather than holding millions of URLs in RAM, each thread writes its results to disk, then the main process concatenates and deduplicates these files during final output.

Filtering granularity is another key feature. You can constrain queries by year (-y 2020) to focus on specific timeframes, or target exact indexes (-i CC-MAIN-2020-10) for surgical precision. This matters when investigating specific security incidents—if a vulnerability was introduced in Q2 2019, you can limit your URL enumeration to that window rather than processing a decade of crawl data.

The tool's simplicity is both strength and limitation. It doesn't attempt sophisticated deduplication beyond basic set operations, nor does it filter by file extension or path patterns (though piping to grep handles this). The 'direct-grep' feature mentioned in the roadmap would add regex filtering during the fetch phase, reducing bandwidth by filtering server-side, but remains unimplemented. For now, the Unix philosophy prevails: cc.py extracts URLs, and you compose it with other tools for transformation.

Gotcha

The primary limitation is dependency on CommonCrawl's coverage and freshness. CommonCrawl crawls monthly, but doesn't index the entire web—it focuses on public, linkable content. If your target domain has poor link popularity, blocks crawlers, or launched recently, cc.py will return sparse or empty results. I've seen it return nothing for legitimate six-month-old startups while yielding 2 million URLs for enterprise domains. There's no feedback mechanism to indicate whether zero results means 'not crawled' versus 'genuinely no URLs'.

Performance is highly variable and network-dependent. Querying large domains can take 30+ minutes even with multithreading, as you're downloading potentially gigabytes of index data from CommonCrawl's S3-backed CDN. The tool provides no progress indicators beyond URL output, so long-running queries feel unresponsive. Additionally, duplicate URLs across different crawls are common—you'll need post-processing with sort -u to get unique results. The lack of built-in HTTP client retry logic means transient network failures can cause incomplete results with no error indication.

Verdict

Use if: You're performing reconnaissance on established domains (3+ years old) for bug bounty or penetration testing engagements where discovering deprecated endpoints, forgotten admin interfaces, or historical API versions provides significant value. It's particularly powerful when combined with tools like ffuf or httpx to validate which historical URLs remain accessible. Also ideal for OSINT investigations requiring temporal analysis of a domain's web presence evolution. Skip if: You're targeting recently launched applications, need real-time URL discovery, require sophisticated filtering beyond basic domain matching, or want a batteries-included solution with deduplication and result ranking. For those cases, consider gau (multi-source aggregation) or waybackurls (better maintained with Wayback Machine support). Also skip if you're working in bandwidth-constrained environments—the tool can download gigabytes of index data for popular domains.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/cybersecurity/si9int-cc-py.svg)](https://starlog.is/api/badge-click/cybersecurity/si9int-cc-py)