Mining the Internet's DNS Records: How DNSpop Reveals What Subdomains Actually Exist

Hook

The subdomain 'www' appears on 37% of domains, but what comes second? Most security researchers guess wrong—and that gap in knowledge costs them during reconnaissance.

Context

When penetration testers enumerate subdomains, they typically rely on wordlists cobbled together from various sources or generate permutations algorithmically. The problem? These approaches are either based on outdated assumptions or pure speculation. A tester might try 'dev.target.com' and 'staging.target.com' because those seem logical, but what if most organizations actually use 'uat' or 'preprod' instead?

This is where empirical data becomes invaluable. Rapid7's Project Sonar performs regular Forward DNS (FDNS) sweeps of the entire IPv4 address space, resulting in datasets containing billions of DNS records. But this raw data—hundreds of gigabytes compressed—is unwieldy and difficult to analyze. DNSpop bridges this gap by processing these massive datasets to extract actionable intelligence: which subdomains actually appear most frequently across the real internet. Instead of guessing what subdomains might exist, security researchers can now reference actual usage patterns derived from billions of domains.

Technical Insight

System architecture — auto-generated

DNSpop's architecture is deceptively simple but thoughtfully designed. At its core, the main shell script subpop.sh orchestrates a pipeline that downloads compressed FDNS data, decompresses it on-the-fly, and processes each DNS record through a series of transformations. The elegance lies in how it handles scale: rather than loading entire datasets into memory, it uses stream processing with standard Unix tools.

The critical challenge in this analysis is distinguishing between true subdomains and the complexity of public suffixes. A naive approach might treat 'example.co.uk' as having a subdomain 'example' under 'co.uk', when 'co.uk' is actually the public suffix and 'example' is the registered domain. This is where the Python helper script suffix_strip.py becomes essential. It leverages the Public Suffix List (PSL) to correctly identify the boundary between registered domains and their actual subdomains:

import sys
import publicsuffix2

psl = publicsuffix2.PublicSuffixList()

for line in sys.stdin:
    domain = line.strip()
    try:
        # Get the registered domain (e.g., 'example.co.uk')
        registered = psl.get_public_suffix(domain)
        # Extract everything before the registered domain
        if domain != registered and domain.endswith('.' + registered):
            subdomain = domain[:-len(registered)-1]
            print(subdomain)
    except:
        pass

This script acts as a filter in the pipeline, taking fully-qualified domain names and outputting only the subdomain portion. When 'mail.example.co.uk' enters, 'mail' comes out. When 'www.shop.example.com' enters, 'www.shop' emerges—capturing multi-level subdomain patterns that are surprisingly common.

The shell script then aggregates these subdomains using classic Unix text processing. After extracting subdomains, it pipes them through sort | uniq -c | sort -rn to count occurrences and rank by frequency. This approach, while not as performant as modern big data frameworks like Spark, has the advantage of transparency and minimal dependencies:

# Simplified excerpt of the pipeline concept
zcat fdns_*.json.gz | \
  jq -r '.name' | \
  python3 suffix_strip.py | \
  sort | \
  uniq -c | \
  sort -rn > subdomain_popularity.txt

The actual implementation includes additional filtering to handle malformed records, exclude certain patterns, and manage the massive data volume efficiently. By processing the data as a stream rather than loading it entirely into memory, DNSpop can handle datasets exceeding available RAM—critical when working with 100GB+ compressed files that expand to terabytes.

What makes this particularly valuable is that it captures real-world behavior, not theoretical possibilities. The results reveal that after 'www', subdomains like 'mail', 'webmail', 'smtp', and 'ftp' dominate—but also surface less obvious patterns like 'cpanel', 'webdisk', and 'autodiscover' that reflect actual hosting infrastructure. Multi-level subdomains also appear in the data, showing patterns like 'mail.admin' or 'api.v1' that wouldn't surface in simple wordlist generation.

Gotcha

The primary limitation is accessibility. Running DNSpop yourself requires downloading Project Sonar's FDNS datasets, which are typically 80-200GB compressed and expand to well over a terabyte. You'll need significant bandwidth, storage, and processing time—expect the analysis to take hours or days depending on your hardware. For most users, this is impractical, which is why the real value lies in the pre-computed results already in the repository rather than the analysis code itself.

There's also a temporal problem: DNS patterns evolve. The subdomain popularity you derive from a dataset is a snapshot of when that scan was performed. New technologies emerge, hosting patterns shift, and security practices change. A list generated from 2022 data might miss 'nextcloud' subdomains that became popular in 2023, or include 'svn' subdomains that have largely disappeared as teams migrated to Git. The analysis is only as fresh as your source data, and re-running it regularly to maintain current results requires sustained access to Rapid7's datasets and the infrastructure to process them. For most practical applications in security research, results that are 6-12 months old remain useful, but they shouldn't be treated as comprehensive ground truth.

Verdict

Use DNSpop if you're conducting security research that benefits from empirically-derived subdomain lists, building reconnaissance tooling, or studying internet infrastructure patterns at scale. The pre-computed results provide immediately actionable intelligence for penetration testing and bug bounty hunting—just grab the output files and integrate them into your enumeration workflow. The methodology is also valuable if you're working with any large-scale DNS dataset and need a reference implementation for subdomain extraction with proper public suffix handling. Skip if you need real-time or frequently-updated DNS data, lack the infrastructure to process 100GB+ datasets yourself, or want an automated API-based solution rather than static lists. Most developers should treat DNSpop as a source of high-quality wordlists rather than a tool to run directly—download the results, not the code, unless you're specifically researching DNS patterns or validating the methodology.

Mining the Internet's DNS Records: How DNSpop Reveals What Subdomains Actually Exist

Mining the Internet's DNS Records: How DNSpop Reveals What Subdomains Actually Exist

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

Mining the Internet's DNS Records: How DNSpop Reveals What Subdomains Actually Exist

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// RELATED

Open Interpreter: Running GPT-4 with Root Access to Your Machine

Accomplish: Why Wrapping OpenCode Instead of Building an Agent Runtime Was the Right Bet

NVIDIA Cosmos: A Case Study in Strategic Repository Deprecation

How Ripgrep Makes Searching 10x Faster Than Grep: A Deep Dive Into Rust-Powered Text Search

Open Interpreter: Running GPT-4 with Root Access to Your Machine

Accomplish: Why Wrapping OpenCode Instead of Building an Agent Runtime Was the Right Bet

NVIDIA Cosmos: A Case Study in Strategic Repository Deprecation

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]