Mining Certificate Transparency Logs for Subdomain Intelligence

Hook

Every hour, thousands of SSL certificates are issued across the internet—and each one broadcasts subdomain patterns that reveal how companies structure their infrastructure. One developer turned this public data stream into a continuously updated intelligence feed.

Context

Before Certificate Transparency became mandatory in 2018, subdomain discovery was a cat-and-mouse game. Security researchers and penetration testers relied on DNS brute-forcing (hammering nameservers with wordlists), zone transfer attempts (rarely successful), or scraping search engines and web archives. These approaches were either noisy, incomplete, or computationally expensive.

Certificate Transparency changed the game by requiring Certificate Authorities to publicly log every SSL/TLS certificate they issue. What started as an accountability mechanism to prevent mis-issuance became an unexpected goldmine for reconnaissance. When a company provisions a new subdomain—staging.api.example.com, internal-vpn.example.com, dev-database.example.com—and secures it with HTTPS, that subdomain gets permanently recorded in CT logs. The internetwache/CT_subdomains project recognized that by continuously monitoring these logs and aggregating subdomain patterns across the entire internet, you could build a frequency-ranked dataset of how organizations actually name their infrastructure. Instead of guessing common patterns, you'd have empirical data about what subdomains appear most often in the wild.

Technical Insight

System architecture — auto-generated

The architecture is elegantly simple: a listener process subscribes to the certstream WebSocket feed, which aggregates CT log entries from multiple log servers in near real-time. When a certificate is issued for "*.staging.example.com" or "api-v2.production.example.com", certstream broadcasts that event. The listener parses the Subject Alternative Name (SAN) fields from each certificate—where all covered domains live—extracts subdomains, and increments counters in a database.

The parsing logic needs to handle wildcard certificates carefully. A certificate for "*.example.com" doesn't tell you specific subdomains, but a certificate for "admin.staging.example.com" reveals three pieces of intelligence: the "admin" prefix is used, "staging" environments exist, and they're nested. The frequency accumulation is key—if "api" appears in 50,000 certificates and "dev-legacy-backup" appears in 3, the ranking reflects actual deployment patterns across thousands of organizations.

Here's how you'd implement a basic version of the listener yourself using certstream:

import certstream
import re
from collections import Counter

subdomain_counts = Counter()

def extract_subdomains(domains):
    subdomains = []
    for domain in domains:
        # Skip wildcards and extract subdomain portion
        if domain.startswith('*.'):
            continue
        parts = domain.split('.')
        if len(parts) > 2:
            # Extract everything before the root domain
            # e.g., "api.staging.example.com" -> ["api", "staging"]
            subdomains.extend(parts[:-2])
    return subdomains

def cert_callback(message, context):
    if message['message_type'] == 'certificate_update':
        all_domains = message['data']['leaf_cert']['all_domains']
        subs = extract_subdomains(all_domains)
        for sub in subs:
            subdomain_counts[sub] += 1
        
        # Print top 10 every 1000 certificates
        if sum(subdomain_counts.values()) % 1000 == 0:
            print(subdomain_counts.most_common(10))

certstream.listen_for_events(cert_callback, url='wss://certstream.calidog.io/')

This script would start revealing patterns within minutes: "www" dominates, followed by "mail", "webmail", "ftp", and "cpanel" for hosting providers. You'd see environment indicators like "dev", "staging", "prod", "test". Regional patterns emerge: "www2", "www3" for load balancing. Technology stack hints appear: "jenkins", "gitlab", "jira", "confluence".

The export mechanism runs hourly, querying the database for the most frequently observed subdomains and publishing lists at different tiers: top 100 for the most common patterns, top 1K for broader reconnaissance, and top 100K for comprehensive coverage. These lists get committed to the GitHub repository, creating a historical record. You can diff commits to see which new subdomain patterns entered the top rankings, potentially indicating emerging technologies or naming trends.

The frequency-based ranking serves a specific purpose for bug bounty hunters and pentesters. When testing "example.com", instead of trying random subdomain guesses, you start with empirically validated patterns: if "api" appears in 10% of all certificates on the internet, it's worth testing "api.example.com" first. The ranked lists become optimized wordlists, front-loaded with high-probability candidates. This transforms reconnaissance from guesswork into data-driven testing.

One architectural limitation worth noting: the system doesn't maintain parent domain relationships. When "staging" gets counted, there's no record of whether it came from "staging.shopify.com" or "staging.example.com". This makes the dataset useful for pattern discovery but not for targeted enumeration against a specific domain. For that use case, you'd query CT logs directly with domain filters via services like crt.sh or Facebook's CT API.

Gotcha

The most significant blind spot is coverage: this approach only sees subdomains that receive certificates. Internal tools behind VPNs, development servers with self-signed certificates, or services that exclusively use wildcard certificates ("*.internal.example.com" covers everything without revealing specifics) never appear. Wildcard certificates are increasingly common for microservices architectures, which means modern cloud-native infrastructure is less visible in CT logs than traditional hosting setups.

The frequency ranking has a temporal problem—it's cumulative, not windowed. A subdomain that was extremely popular in 2019 but is now obsolete (perhaps a discontinued service or deprecated technology) remains highly ranked because historical observations never decay. The lists reflect "most frequently observed since data collection began", not "most actively used right now". For current reconnaissance, this means you might waste time testing patterns that were common for old WordPress hosting but irrelevant for modern SaaS infrastructure. The hourly update frequency also introduces lag: if a company spins up "emergency-patch.example.com" right now, it won't appear in your downloaded lists for at least 60 minutes, and won't rank highly until enough observations accumulate. For time-sensitive security work, this isn't truly real-time intelligence.

Verdict

Use if: You're doing broad reconnaissance across multiple targets and need a curated wordlist of empirically validated subdomain patterns to test, you're researching internet infrastructure trends and naming conventions at scale, or you want a low-effort way to discover common subdomain patterns without running your own CT log monitoring infrastructure. It's particularly valuable when you're starting bug bounty work on a new target and want to quickly test high-probability subdomains before investing in comprehensive enumeration. Skip if: You need complete subdomain enumeration for a specific target (use active tools like Amass or query crt.sh directly with domain filters), require real-time alerting when new certificates are issued for domains you're monitoring, need subdomains mapped to their parent domains for targeted analysis, or are testing organizations that heavily use wildcard certificates or internal infrastructure. For production security monitoring, build your own certstream pipeline with domain-specific filtering rather than relying on generic ranked lists.

Mining Certificate Transparency Logs for Subdomain Intelligence

Mining Certificate Transparency Logs for Subdomain Intelligence

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

Mining Certificate Transparency Logs for Subdomain Intelligence

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// RELATED

Open Interpreter: Running GPT-4 with Root Access to Your Machine

Accomplish: Why Wrapping OpenCode Instead of Building an Agent Runtime Was the Right Bet

NVIDIA Cosmos: A Case Study in Strategic Repository Deprecation

How Ripgrep Makes Searching 10x Faster Than Grep: A Deep Dive Into Rust-Powered Text Search

Open Interpreter: Running GPT-4 with Root Access to Your Machine

Accomplish: Why Wrapping OpenCode Instead of Building an Agent Runtime Was the Right Bet

NVIDIA Cosmos: A Case Study in Strategic Repository Deprecation

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]