Back to Articles

ZAnnotate: Enriching Billions of IP Addresses Without Crushing Your Pipeline

[ View on GitHub ]

ZAnnotate: Enriching Billions of IP Addresses Without Crushing Your Pipeline

Hook

When you’re processing the results of a full IPv4 scan—4.3 billion addresses—even a 100ms API call per IP would take 13 years to complete. This is why Internet measurement researchers built ZAnnotate.

Context

Internet-scale research generates ungodly amounts of raw IP addresses. Security researchers scanning for vulnerable hosts, academic projects measuring Internet adoption, and network operators analyzing traffic patterns all face the same problem: an IP address like 185.220.101.45 tells you almost nothing without context. Is it a Tor exit node in Germany? A misconfigured IoT device in Brazil? A datacenter in AWS?

Traditionally, enriching IP data meant choosing between slow API-based services (fine for hundreds of IPs, catastrophic for millions) or writing custom code to query MaxMind databases, parse BGP routing tables, perform reverse DNS lookups, and stitch everything together. The ZMap Project—the team behind the blazingly fast network scanner that can probe the entire IPv4 space in under an hour—built ZAnnotate to solve this exact bottleneck. It’s designed as the second stage in a measurement pipeline: ZMap finds the hosts, ZAnnotate tells you what they actually are.

Technical Insight

Annotation Engines

line-delimited

geo metadata

ASN/prefix

hostname

context data

enriched JSON

pipe to next

zannotate

IP address

stdin

(IP or JSON)

Input Parser

(IP/JSON)

GeoIP2

(MMDB)

BGP Routing

(MRT)

Reverse DNS

(Lookup)

IPInfo.io

(DB)

JSON Enricher

stdout

(JSON)

System architecture — auto-generated

ZAnnotate’s architecture is deceptively simple: it’s a streaming processor that reads newline-delimited data from stdin, enriches each record with metadata from local databases or DNS queries, and writes JSON to stdout. This Unix-philosophy design means it composes naturally with other tools and never loads entire datasets into memory.

The tool supports multiple annotation modes that can be chained together. Here’s a practical example enriching scan results with both GeoIP and ASN data:

# Basic usage: pipe IPs, get JSON annotations
echo "8.8.8.8" | zannotate --annotator=geoip2 \
  --geoip2-database=/var/lib/GeoLite2-City.mmdb

# Output:
{"ip":"8.8.8.8","geoip2":{"city":"Mountain View","country":"US",
"location":{"latitude":37.386,"longitude":-122.0838},
"postal_code":"94035"}}

# Chain multiple annotators for comprehensive enrichment
cat scan_results.txt | \
  zannotate --annotator=geoip2 --geoip2-database=GeoLite2-City.mmdb | \
  zannotate --input-format=json --annotator=routing \
    --routing-mrt-file=routeviews.mrt | \
  zannotate --input-format=json --annotator=rdns 

The --input-format=json flag is crucial for chaining: instead of replacing input data, ZAnnotate injects a zannotate field into existing JSON objects. This means you can start with structured data from ZMap (which already includes ports, banners, etc.) and progressively enrich it without losing information.

Under the hood, the GeoIP annotator uses MaxMind’s native MMDB format through their Go library, keeping the entire database memory-mapped for fast lookups. The routing annotator parses MRT (Multi-Threaded Routing Toolkit) files—binary dumps of BGP routing tables from RouteViews or RIPE RIS collectors. These files contain prefix-to-ASN mappings and AS path information:

# Download a BGP routing snapshot
wget http://archive.routeviews.org/route-views4/bgpdata/2024.01/RIBS/rib.20240115.0000.bz2
bzip2 -d rib.20240115.0000.bz2

# Annotate with AS origin information
cat targets.txt | zannotate --annotator=routing \
  --routing-mrt-file=rib.20240115.0000 \
  --routing-annotate-origin-as

# Result includes ASN, AS name, and prefix
{"ip":"185.220.101.45","routing":{"asn":16276,
"as_name":"OVH SAS","prefix":"185.220.101.0/24"}}

The reverse DNS annotator deserves special attention because it’s the only one that makes network requests. To avoid overwhelming DNS infrastructure, ZAnnotate implements configurable rate limiting and timeout controls. For massive datasets, you’d typically run this last and potentially in parallel batches:

# Rate-limited reverse DNS (10 queries/sec)
cat ips.txt | zannotate --annotator=rdns \
  --rdns-threads=10 --rdns-timeout=5s

One elegant design choice: ZAnnotate emits empty fields rather than errors when data is missing. If an IP has no reverse DNS entry, you get {"rdns":""} rather than a broken pipeline. This fault-tolerance is essential when processing messy real-world data where not every IP will have GeoIP records or AS assignments.

The tool also supports IPInfo.io’s MMDB databases and custom metadata CSV files, making it extensible for organization-specific enrichment (like marking internal IP ranges or threat intelligence feeds).

Gotcha

ZAnnotate’s offline-first design is both its greatest strength and biggest operational burden. Those MaxMind and IPInfo databases? You need to download them manually, store them locally, and update them regularly—GeoIP data goes stale as IP blocks are reassigned, and BGP routing tables change constantly. The tool has zero built-in update mechanisms, so you’re responsible for building cron jobs or automation to keep data fresh. A six-month-old routing table will give you outdated ASN assignments, potentially skewing research results.

Reverse DNS annotation, while powerful, can become a bottleneck. Even with rate limiting and parallelism, performing actual DNS queries for millions of IPs takes real time. If you’re processing a fresh ZMap scan of 50 million responsive hosts, the rDNS phase alone could run for hours or days. And since it requires network connectivity, you can’t do this step offline during a transatlantic flight like you can with local database lookups. There’s also no retry logic for transient DNS failures—if a query times out, you get an empty field and need to handle re-processing yourself. For truly massive datasets, you’ll want to architect around this by batching IPs, parallelizing across multiple machines, or accepting that reverse DNS coverage will be partial.

Verdict

Use ZAnnotate if you’re processing large-scale Internet measurement data (thousands to billions of IPs), need reproducible offline enrichment for research, or are building analysis pipelines that combine GeoIP, ASN, and routing metadata. It’s particularly valuable for security researchers analyzing scan results, network operators investigating traffic sources, or academics publishing Internet measurement studies where you need to cite specific database versions. The streaming architecture makes it perfect for integration with ZMap, Masscan, or custom scanners. Skip it if you’re doing ad-hoc lookups of a few hundred IPs (just use MaxMind’s web interface or IPInfo’s API), need real-time data with automatic updates (look at API-based services instead), require Windows support (it’s primarily tested on Linux), or don’t want the operational overhead of managing local database files. For casual geolocation needs in web applications, stick with hosted APIs—ZAnnotate is infrastructure for researchers, not a drop-in library.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/ai-dev-tools/zmap-zannotate.svg)](https://starlog.is/api/badge-click/ai-dev-tools/zmap-zannotate)