Building a Terabyte-Scale Internet Intelligence Pipeline with inetdata

Hook

Every month, organizations like Rapid7 scan the entire IPv4 address space and publish terabytes of raw data for free. The problem? That data is nearly useless until you can query it, correlate it, and normalize it into something actionable.

Context

Security researchers and threat intelligence teams need comprehensive visibility into internet infrastructure: which domains resolve to which IPs, what certificates are being issued, how networks are organized. While sources like Rapid7's Project Sonar, Certificate Transparency logs, and DNS zone files provide this data freely, each source has different formats, update schedules, and access methods. A researcher investigating a phishing campaign might need to correlate passive DNS data from Sonar with certificate transparency logs and WHOIS records—but doing this manually means downloading hundreds of gigabytes, writing custom parsers for each format, and figuring out how to query datasets that don't fit in memory.

This is the gap inetdata fills. Created by HD Moore (founder of Metasploit), it's a production-grade data pipeline that handles the unglamorous work of downloading, parsing, and indexing internet-scale reconnaissance data. Rather than building yet another scanning tool, inetdata orchestrates the acquisition and normalization of existing data sources into a unified format optimized for fast lookups. It's designed for the reality of internet data collection: multi-hour downloads, heterogeneous formats, rate limits, and monthly data volumes measured in terabytes. The result is a queryable database of internet infrastructure that updates automatically and supports both historical and real-time analysis.

Technical Insight

System architecture — auto-generated

At its core, inetdata is a two-phase pipeline: acquisition and normalization. The acquisition phase uses source-specific downloaders that understand the peculiarities of each data provider—Sonar's directory structure, Censys's API pagination, Certificate Transparency's log servers. The normalization phase transforms everything into two output formats: CSV for human readability and MTBL (immutable sorted map tables) for efficient querying.

The MTBL format is the architectural centerpiece. Traditional databases struggle with datasets that grow by hundreds of gigabytes monthly. MTBL files are immutable, sorted key-value stores that support binary search without loading entire datasets into memory. This makes them ideal for the "lookup" pattern common in threat intelligence: given a domain, find all associated IPs; given an IP, find all domains that resolved to it. Here's how you'd query an MTBL file for reverse DNS lookups:

require 'mtbl'

# Open the MTBL database (memory-mapped, no full load)
reader = MTBL::Reader.new('sonar-rdns.mtbl')

# Lookup all domains for an IP
ip = '8.8.8.8'
reader.get(ip) do |key, values|
  # Values are sorted and deduplicated
  values.each do |domain|
    puts "#{ip} -> #{domain}"
  end
end

# Or iterate through a range
reader.get_range('8.8.8.0', '8.8.8.255') do |ip, domains|
  puts "Found #{domains.length} domains for #{ip}"
end

The configuration-driven approach means adding a new data source is mostly declarative. Each source has a JSON config specifying download URLs, credentials, file patterns, and normalization scripts. The sonar_fdns (forward DNS) source, for example, downloads compressed JSON files from Rapid7's Open Data API, decompresses them, and pipes records through an inetdata-parsers tool that outputs standardized CSV:

# Simplified example of the processing chain
source_url = 'https://opendata.rapid7.com/sonar.fdns_v2/'
download_files(source_url, pattern: /\.json\.gz$/) do |file|
  # Decompress and normalize in a streaming pipeline
  IO.popen("gunzip -c #{file} | inetdata-parsers-sonar-fdns-json") do |parsed|
    parsed.each_line do |line|
      # Line format: timestamp,domain,type,value
      timestamp, domain, record_type, value = line.strip.split(',')
      
      # Write to both CSV and MTBL builder
      csv_output << line
      mtbl_builder.add(domain, value) # Forward lookup
      mtbl_builder.add(value, domain) # Reverse lookup
    end
  end
end

Resource management is critical at this scale. The repository documentation explicitly calls for raising file descriptor limits to 20,000+ because processing involves opening hundreds of compressed files simultaneously for streaming. Memory pressure comes from MTBL building—the gem needs to sort entries before writing, which for Sonar's billion-record datasets means careful batching. The recommended 16GB+ RAM isn't a suggestion; it's a requirement for monthly full-runs.

The architecture also handles operational realities like credential management. Many sources require API keys or premium accounts. Rather than hardcoding credentials, inetdata uses environment variables and config files in ~/.inetdata/, keeping secrets out of the codebase. Rate limiting is baked into downloaders—Certificate Transparency logs, for instance, will aggressively throttle bulk downloads without careful request spacing.

What makes this production-ready is the separation of concerns. The main inetdata repository orchestrates workflow; the separate inetdata-parsers project (written in Go for performance) handles the CPU-intensive parsing. This means you can parallelize parsing across multiple cores while a single Ruby process manages downloads. For a full Sonar dataset, you might see 8+ parser processes running simultaneously, each consuming a different compressed file, all feeding into the same MTBL builder through IPC.

Gotcha

The infrastructure requirements aren't just recommendations—they're hard barriers. Attempting to run inetdata on a typical developer laptop will result in out-of-memory errors during MTBL building or file descriptor exhaustion mid-download. You need dedicated infrastructure: 1TB+ storage monthly, 16GB+ RAM, and multiple cores. Cloud instances work, but at ~$200-500/month for suitable specs, this isn't a tool you spin up casually.

Some sources are deliberately excluded from automatic daily runs because they're simply too massive. Certificate Transparency logs and Censys datasets can take 12-24+ hours just to download, let alone process. This means you can't have a "set and forget" pipeline that captures everything—you'll need to manually trigger these sources and probably run them on weekends. The documentation also warns that some commercial sources like WhoisXMLAPI and PremiumDrops require paid subscriptions, and at internet scale, those costs add up quickly. There's also the inetdata-parsers dependency, which must be compiled separately and kept in your PATH—not complex, but an additional maintenance burden that can break if versions drift.

Verdict

Use if: You're building a comprehensive threat intelligence platform that needs regular, queryable access to internet-wide reconnaissance data; you have dedicated infrastructure (16GB+ RAM, 1TB+ monthly storage); you need to correlate multiple data sources (passive DNS + certificates + WHOIS); or you're conducting academic research requiring historical internet data at scale. This tool shines when you need "show me every domain that resolved to this IP in the last 90 days" answered in seconds, not hours. Skip if: You only need current snapshots (just use Censys/Shodan APIs directly); you're working on a laptop or resource-constrained environment; you need data from a single source (use that provider's tools); or you want ad-hoc queries without managing infrastructure. For occasional lookups, services like SecurityTrails or PassiveTotal will be faster to start and cheaper overall. Inetdata is for organizations serious about building their own internet intelligence capability, not for weekend projects.

Building a Terabyte-Scale Internet Intelligence Pipeline with inetdata

Building a Terabyte-Scale Internet Intelligence Pipeline with inetdata

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

Building a Terabyte-Scale Internet Intelligence Pipeline with inetdata

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// RELATED

4D Gaussian Splatting: How Hexplane Factorization Makes Real-Time Dynamic Scene Rendering Possible

Honcho: The Peer Memory Graph That Replaces RAG for Long-Running Agents

NocoDB: The Self-Hosted Database That Speaks Spreadsheet

Big List of Naughty Strings: The Test Dataset That Breaks Your Input Validation

4D Gaussian Splatting: How Hexplane Factorization Makes Real-Time Dynamic Scene Rendering Possible

Honcho: The Peer Memory Graph That Replaces RAG for Long-Running Agents

NocoDB: The Self-Hosted Database That Speaks Spreadsheet

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]