Back to Articles

Mining CommonCrawl's Petabyte Archives for Subdomain Intelligence

[ View on GitHub ]

Mining CommonCrawl's Petabyte Archives for Subdomain Intelligence

Hook

Your target's forgotten staging server from 2015 is still running with default credentials. Modern subdomain enumeration tools won't find it, but CommonCrawl remembers everything.

Context

Traditional subdomain enumeration has always focused on the present. Tools scrape certificate transparency logs, bruteforce DNS records, or parse current search engine results. This approach works well for active infrastructure but creates a dangerous blind spot: historical subdomains that still resolve but are no longer actively linked or advertised. A company might have decommissioned a staging environment in their documentation, removed links from their main site, but left the actual server running with outdated software and weak security.

CommonCrawl maintains one of the internet's largest historical archives—over 250 billion web pages captured since 2008, with new snapshots added monthly. This dataset represents a temporal dimension that most reconnaissance workflows ignore. CCrawlDNS bridges this gap by treating CommonCrawl as a passive DNS enumeration source, essentially letting you query "what subdomains existed on this domain at any point in the last 15 years?" For red teamers, penetration testers, and security researchers, this historical perspective often reveals forgotten infrastructure that represents the softest targets in an attack surface.

Technical Insight

Rate Limiting

.domain.com/

CDX Records

Stream

Extract URLs

Check/Store

Unique Subdomains

Paths & Languages

429 Response

Retry Delay

User Input

Target Domain

Query Builder

CommonCrawl

Index API

CDX Parser

Deduplication

Engine

Local SQLite

Database

Web Fingerprinting

Results Output

Auto Throttle

System architecture — auto-generated

CCrawlDNS operates through a straightforward but clever architecture. It queries CommonCrawl's Index Server API, which provides search functionality across their petabyte-scale CDX (Capture Index) files. These index files contain metadata about every URL CommonCrawl has ever crawled—domain, path, timestamp, and MIME type—without requiring you to download the actual WARC files containing the page content.

The tool constructs queries in the format *.target.com/* to retrieve all subdomains, then parses the returned CDX records to extract unique subdomain names. Here's a simplified example of how the core query mechanism works:

import requests
import re
from urllib.parse import urlparse

def query_commoncrawl(domain, dataset_index):
    # CommonCrawl indexes are named CC-MAIN-YYYY-WW
    index_url = f"http://index.commoncrawl.org/{dataset_index}-index"
    query_url = f"{index_url}?url=*.{domain}/*&output=json"
    
    subdomains = set()
    
    response = requests.get(query_url, stream=True)
    for line in response.iter_lines():
        if line:
            record = json.loads(line)
            url = record.get('url', '')
            parsed = urlparse(url)
            subdomain = parsed.netloc
            
            # Extract subdomain from full hostname
            if subdomain.endswith(f".{domain}"):
                subdomains.add(subdomain)
    
    return subdomains

The real intelligence in CCrawlDNS comes from its database persistence layer. Rather than re-querying CommonCrawl's API repeatedly (which is slow and subject to rate limiting), the tool maintains a local SQLite database tracking which domains have been queried and which datasets have been searched. This means you can run incremental scans—check the last three months of new CommonCrawl data without re-processing years of historical records you've already searched.

The tool also implements automatic throttling to respect CommonCrawl's infrastructure. Making thousands of API requests in rapid succession will get you rate-limited, so CCrawlDNS spaces out requests intelligently. For large reconnaissance campaigns targeting multiple domains or comprehensive historical searches, this throttling means scans can take hours or even days—but the tradeoff is accessing a data source that competitors likely aren't using.

Beyond subdomain extraction, CCrawlDNS performs path and language fingerprinting from the CDX records. Since each record includes the full URL path and MIME type, the tool can identify patterns like /admin/, /staging/, /dev/, or detect servers serving content in specific languages. This metadata provides operational context about each subdomain without requiring you to actively probe the targets:

def fingerprint_paths(cdx_records):
    interesting_paths = {
        'admin': re.compile(r'/admin|/administrator|/wp-admin'),
        'dev': re.compile(r'/dev|/development|/test|/staging'),
        'api': re.compile(r'/api/|/rest/|/graphql'),
        'backup': re.compile(r'\.bak$|\.backup$|\.old$|/backup/')
    }
    
    findings = {}
    for record in cdx_records:
        url_path = urlparse(record['url']).path
        subdomain = urlparse(record['url']).netloc
        
        for category, pattern in interesting_paths.items():
            if pattern.search(url_path):
                findings.setdefault(subdomain, set()).add(category)
    
    return findings

The temporal querying capability is particularly powerful. You can scope searches to specific years or datasets, enabling targeted reconnaissance. For instance, if you know a company underwent a major infrastructure migration in 2019, you can query 2018-2019 datasets to identify old infrastructure that might still be accessible. Or conduct quick reconnaissance by only searching the most recent year, then expand historically if initial results are promising.

Gotcha

CCrawlDNS's fundamental limitation is that it's entirely dependent on CommonCrawl's crawling decisions. If a subdomain was never discovered by CommonCrawl's crawlers—perhaps because it was never publicly linked, was behind robots.txt restrictions, or existed only between CommonCrawl's monthly snapshot windows—it won't appear in results. Internal subdomains, development environments without public links, and infrastructure added in the last few weeks simply won't be there. You're viewing the internet through CommonCrawl's lens, not conducting active reconnaissance.

Performance is the second major constraint. Querying comprehensive historical data (all available years and datasets) for a popular domain can take hours due to API rate limiting and the sheer volume of data. The tool appears to be single-threaded, processing one API request at a time. For reconnaissance campaigns targeting dozens or hundreds of domains, this becomes impractical. You'll need to decide between speed (query only recent datasets) and comprehensiveness (query all history). There's also no built-in deduplication intelligence—if a subdomain appears in 50 different monthly snapshots, you'll process it 50 times unless the database layer handles it. For large enterprises with extensive web presence, the result sets can become unwieldy without post-processing and filtering.

Verdict

Use if: You're conducting thorough attack surface assessments on mature organizations where historical infrastructure matters, performing red team reconnaissance where discovering forgotten staging/dev environments could provide initial access, doing bug bounty work on established programs where other hunters are likely only using modern CT logs, or building long-term intelligence on a target where understanding infrastructure evolution over years provides strategic value. CCrawlDNS excels in scenarios where stealth and historical depth trump speed. Skip if: You need real-time subdomain discovery for recently launched domains (use certificate transparency via crt.sh or Subfinder instead), you're under time pressure requiring results in minutes not hours (Amass with its multi-source approach is faster for comprehensive current enumeration), you're targeting infrastructure with strong security practices that decommissions and properly shuts down old systems (the tool's value prop disappears), or you need active validation that subdomains currently resolve (CCrawlDNS only tells you what existed historically, not what's live now—pair it with massdns or httpx for validation).

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/automation/lgandx-ccrawldns.svg)](https://starlog.is/api/badge-click/automation/lgandx-ccrawldns)