Back to Articles

Inside pzb/TLDs: The Repository That Captures DNS Data for America's Most Restricted Domains

[ View on GitHub ]

Inside pzb/TLDs: The Repository That Captures DNS Data for America's Most Restricted Domains

Hook

The .gov and .mil top-level domains serve millions of users daily, yet unlike nearly every other TLD on the internet, you can't simply download their zone files to see what's inside.

Context

In the DNS ecosystem, transparency is the norm. ICANN's Centralized Zone Data Service (CZDS) provides authorized researchers and organizations with access to zone files for hundreds of generic top-level domains—complete snapshots of every registered domain within a TLD. But there's an exception to this rule: certain restricted TLDs including .gov, .mil, .edu, .int, and .arpa don't participate in this system. These domains, managed by specific entities like the U.S. General Services Administration (.gov) or the Department of Defense (.mil), have legitimate operational security reasons for not publishing complete zone files.

This creates a significant blind spot for security researchers, DNS operators, and anyone building comprehensive internet infrastructure databases. If you're conducting security research on government infrastructure, analyzing DNS patterns, or building tools that need awareness of these domains, you're left to reconstruct this data from fragmented sources. The pzb/TLDs repository exists to fill this gap, serving as a manually curated collection that aggregates DNS and WHOIS data for these five restricted TLDs from multiple public sources including government repositories, academic security scanning projects, and internet-wide surveys.

Technical Insight

Data Files

TLDs Repository

Data Sources

.gov data

.edu data

.mil data

isolated to reduce churn

zone recon

GSA/data.gov

Educause DB

DoD Listings

Internet Scans

arpa/ directory

gov/ directory

edu/ directory

mil/ directory

int/ directory

Domain Lists

DNS Records

NSEC3 Hashed Records

System architecture — auto-generated

At its core, pzb/TLDs is a data archaeology project rather than a traditional software repository. The architecture is deliberately simple: separate directories for each TLD (arpa/, edu/, gov/, int/, mil/) containing flat text files with domain listings and associated DNS records. What makes this repository valuable isn't sophisticated code—it's the thoughtful curation and aggregation strategy.

The repository demonstrates an important architectural decision in its data organization: NSEC3 hashed records are isolated into separate files. This design choice reflects deep DNS knowledge. DNSSEC uses NSEC3 records for authenticated denial of existence, and these records include salt parameters that change periodically. By separating NSEC3 data, the repository prevents routine DNSSEC parameter updates from creating massive diffs that would obscure meaningful changes in actual domain registrations. If you're tracking the repository or building systems that consume this data, you can focus on the primary domain lists without noise from cryptographic housekeeping.

The data compilation process relies on multiple upstream sources. For .gov domains, the repository draws from the GSA's government repositories and data.gov datasets. For .edu, it aggregates information from Educause and academic institution databases. The .mil data comes from Department of Defense public listings. For domains that appear in internet-wide scans, the repository incorporates data from the University of Michigan's security scanning projects and Censys.io's certificate transparency logs.

Here's a practical example of how you might work with this data. Suppose you're building a security tool that needs to validate whether a domain claiming to be a U.S. government site is legitimate:

# Load .gov domains from the repository
gov_domains = File.readlines('gov/domains.txt').map(&:strip).to_set

def verify_gov_domain(domain)
  # Extract the registered domain from a full hostname
  # e.g., "subdomain.agency.gov" -> "agency.gov"
  registered_domain = domain.split('.').last(2).join('.')
  
  if gov_domains.include?(registered_domain)
    puts "✓ #{domain} is registered under a legitimate .gov domain"
    true
  else
    puts "✗ WARNING: #{domain} claims .gov but is not in registry"
    false
  end
end

verify_gov_domain("www.whitehouse.gov")  # ✓ Legitimate
verify_gov_domain("phishing.gov")        # ✗ Not in registry

This simple validation becomes powerful when you realize that the alternative would require either scraping numerous government websites, paying for commercial DNS intelligence services, or somehow gaining access to restricted zone data. The repository provides a single source of truth that you can clone, version control, and integrate into your security infrastructure.

For DNS researchers, the repository enables trend analysis that would otherwise be impossible. You can clone the repository at different points in time and diff the data to understand domain registration patterns, track government digital infrastructure expansion, or identify domains that have been retired. Since Git preserves the full history, you're essentially getting a time-series database of restricted TLD registrations:

# Compare .gov domains between two commits to find new registrations
git diff HEAD~10 HEAD gov/domains.txt | grep '^+' | grep -v '+++'

# Count total .mil domains over time
git log --all --pretty=format:'%H %ai' -- mil/domains.txt | \
  while read commit date time tz; do
    count=$(git show $commit:mil/domains.txt | wc -l)
    echo "$date: $count domains"
  done

The data format itself is intentionally simple—plain text files with one domain per line, sometimes supplemented with DNS record data in zone file format. This simplicity is a feature, not a limitation. Any programming language can parse these files without specialized libraries, making the repository maximally accessible whether you're working in Python, Go, JavaScript, or shell scripts.

Gotcha

The elephant in the room is data completeness and freshness. This repository is fundamentally a best-effort reconstruction of data that isn't meant to be public in aggregate form. Unlike the official CZDS service where you're getting authoritative zone files directly from TLD operators, pzb/TLDs is assembling a puzzle from pieces found in various public places. There will be gaps. Newly registered domains might not appear until the next upstream data source publishes, which could be days or weeks. Domains that are registered but not actively used (no public-facing web servers, no certificate transparency logs) might be entirely invisible to the data collection methodology.

The update frequency is another limitation. Looking at the repository's commit history reveals irregular updates—sometimes weeks or months apart. If you're building a system that requires up-to-the-minute accuracy for security decisions, this isn't your solution. You're essentially trading timeliness for accessibility. The repository also offers no programmatic API or automated update mechanism. You're expected to clone the repository and handle your own refresh strategy, which means building Git polling into your infrastructure or settling for whatever freshness your manual update schedule provides.

There's also a conceptual limitation worth understanding: this repository only covers five TLDs. If your use case involves comprehensive domain intelligence across the internet, you'll need to combine this with other data sources like CZDS for standard TLDs, country-code TLD zone files (which have varying availability policies), and commercial DNS intelligence services. The repository is solving a specific visibility gap, not providing universal coverage.

Verdict

Use pzb/TLDs if you're conducting security research on U.S. government or educational infrastructure, building domain validation systems that need to verify .gov or .mil authenticity, or analyzing DNS patterns in these restricted TLDs where official zone file access is unavailable. The repository provides unique value as a curated, version-controlled dataset that would be extremely time-consuming to assemble independently. It's particularly valuable for academic research, security tooling, and compliance systems that need to maintain awareness of these domains without commercial data subscriptions. Skip if you need real-time domain registration data, require SLA-backed accuracy guarantees for production security decisions, or are working primarily with standard TLDs that already provide zone file access through ICANN's official channels. This is a research dataset and reference resource, not a production-grade service, and should be treated accordingly in your architecture.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/data-knowledge/pzb-tlds.svg)](https://starlog.is/api/badge-click/data-knowledge/pzb-tlds)