Subsearch: When Subdomain Enumeration Met Scala's Type Safety

Hook

While modern bug bounty hunters chase certificate transparency logs, subsearch solved subdomain enumeration in 2016 with a technique most tools still ignore: recursively scanning every CNAME and MX record it discovers.

Context

Before 2015, subdomain enumeration was a manual slog through DNS brute-forcing tools that treated each domain in isolation. Security researchers would run a wordlist against example.com, get a list of subdomains, then manually notice that api.example.com pointed to api-gateway.internal.example.com and realize they needed to scan internal.example.com separately. This manual recursion was error-prone and time-consuming.

Subsearch emerged in this gap as one of the first tools to automate recursive subdomain discovery. Built in Scala when most security tools were Python or C, it represented a bet that JVM concurrency primitives and functional programming patterns could handle the complex state management of tracking discovered domains, managing resolver pools, and coordinating multiple data sources. The tool combined passive reconnaissance (VirusTotal, DNS Dumpster) with active brute-forcing, all while maintaining awareness of rate limits and resolver blacklisting—problems that plagued earlier tools and got pentesters blocked mid-scan.

Technical Insight

System architecture — auto-generated

Subsearch's core architectural insight was treating subdomain enumeration as a priority queue problem rather than a linear scan. When it discovers api.example.com, it doesn't just record the finding—it immediately parses the DNS response for CNAME, MX, NS, and SRV records, extracts any new subdomains from those records, and queues them for scanning. This creates a breadth-first search tree through DNS space.

The resolver pool management demonstrates thoughtful concurrent design. Rather than hammering a single DNS server, subsearch accepts a list of resolvers and distributes queries across them. The --single-request flag implements a particularly clever anti-blacklisting strategy: it ensures each resolver only processes one request at a time, spreading your fingerprint thin enough that no single server sees suspicious traffic patterns:

// Simplified resolver pool architecture
case class ResolverPool(resolvers: Seq[String], singleRequest: Boolean) {
  private val queue = new LinkedBlockingQueue[String]()
  resolvers.foreach(queue.add)
  
  def withResolver[T](f: String => T): Future[T] = {
    val resolver = queue.take() // Blocks if pool exhausted
    try {
      Future(f(resolver))
    } finally {
      if (!singleRequest) queue.add(resolver) // Return to pool
    }
  }
}

This pattern means high-throughput scans rotate through resolvers continuously, while stealth scans (--single-request) consume resolvers until the pool empties, forcing rate limiting by resolver availability. It's rate limiting implemented through resource exhaustion rather than timers.

The wordlist streaming implementation (fixed in v0.1.1) shows another pragmatic choice. Early versions loaded entire wordlists into memory, causing OOM errors on large dictionaries. The fix streams lines lazily:

val wordlistStream = Source.fromFile(wordlistPath)
  .getLines()
  .map(word => s"${word}.${domain}")
  .toIterator

// Combines wordlist with discovered subdomains
val candidates = wordlistStream ++ discoveredQueue.iterator

This Iterator-based approach means a 10GB wordlist never exceeds the memory footprint of a single line. Combined with the priority queue for discovered subdomains, subsearch maintains bounded memory regardless of input size.

The CSV output format is deceptively simple but reveals careful thought about downstream tooling integration:

subdomain,ip,source,record_type
api.example.com,192.0.2.1,wordlist,A
api-gateway.internal.example.com,192.0.2.5,recursive_cname,A
mail.example.com,192.0.2.10,virustotal,MX

The source field tracks provenance—whether each subdomain came from brute-forcing, recursive discovery, or third-party APIs. This metadata is critical for pentesters documenting their methodology or researchers analyzing DNS infrastructure patterns. Modern tools often omit this, making it impossible to distinguish active scanning results from passive reconnaissance.

Scala's type system enforces DNS record type safety throughout. Rather than passing strings around, subsearch models DNS responses as algebraic data types, making invalid states unrepresentable. You can't accidentally treat an MX record as a CNAME because the compiler prevents it.

Gotcha

The 2016 release date is the elephant in the room. VirusTotal's API is now on v3 with mandatory API keys and different endpoints, meaning subsearch's integration is almost certainly broken. DNS Dumpster's scraping logic likely fails against modern CloudFlare protections. The tool predates Let's Encrypt's ubiquity and the resulting explosion of certificate transparency logs—now the single richest source of subdomain data, completely absent from subsearch's approach.

Performance comparisons are sobering. Modern Go tools like subfinder scan thousands of subdomains per second using goroutines. Subsearch's JVM threads and Scala abstractions add overhead that was acceptable in 2016 but feels sluggish against 2024 competition. The cold start time alone—JVM initialization, classpath scanning, JIT warmup—can exceed the total runtime of equivalent Go tools on small jobs. And requiring Java 8+ installation is friction that Go's static binaries eliminate entirely. If you're integrating subdomain enumeration into CI/CD or serverless workflows, that startup penalty matters.

Verdict

Use if: You're maintaining a JVM-based security platform and need subdomain enumeration without shelling out to external processes, you're studying recursive DNS enumeration techniques and want readable source code that prioritizes clarity over performance, or you're operating in an environment where Scala's type safety prevents the kind of operational errors that can expose your scanning infrastructure. The recursive discovery architecture and resolver pool patterns remain instructive even if the implementation is dated.

Skip if: You need production-ready subdomain enumeration for pentesting or bug bounties (use subfinder or amass instead), you require certificate transparency integration or other modern passive sources, you're starting a new project and have any choice in language (Go or Rust tools are 10-100x faster), or you can't accept the maintenance risk of depending on an abandoned codebase with broken third-party integrations. The seven-year gap since the last commit isn't just technical debt—it's technical bankruptcy.

Subsearch: When Subdomain Enumeration Met Scala's Type Safety

Subsearch: When Subdomain Enumeration Met Scala's Type Safety

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

Subsearch: When Subdomain Enumeration Met Scala's Type Safety

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// RELATED

Open Interpreter: Running GPT-4 with Root Access to Your Machine

Accomplish: Why Wrapping OpenCode Instead of Building an Agent Runtime Was the Right Bet

NVIDIA Cosmos: A Case Study in Strategic Repository Deprecation

How Ripgrep Makes Searching 10x Faster Than Grep: A Deep Dive Into Rust-Powered Text Search

Open Interpreter: Running GPT-4 with Root Access to Your Machine

Accomplish: Why Wrapping OpenCode Instead of Building an Agent Runtime Was the Right Bet

NVIDIA Cosmos: A Case Study in Strategic Repository Deprecation

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]