ZAnnotate: Building a Multi-Source IP Intelligence Pipeline for Internet-Scale Research
Hook
When the ZMap team scans the entire IPv4 address space in 45 minutes, they're left with 4 billion addresses and zero context. ZAnnotate transforms those raw IPs into actionable intelligence by orchestrating a dozen metadata sources in a single pass.
Context
Security researchers and internet measurement scientists face a recurring problem: IP addresses are meaningless without context. Is 203.0.113.42 a cloud provider in Singapore, a residential ISP in Brazil, or a known malicious actor? Answering these questions manually is impossible at scale.
Traditionally, enriching IP datasets meant writing custom scripts that query MaxMind for geolocation, dig for reverse DNS, whois for registration data, and various APIs for threat intelligence. Each data source has its own format, authentication mechanism, and rate limits. Processing millions of IPs requires orchestrating these heterogeneous sources while handling failures, respecting API quotas, and merging results into a coherent output. The ZMap Project, which built the eponymous network scanner used by researchers worldwide, created ZAnnotate to solve this exact workflow bottleneck: a composable pipeline that treats metadata enrichment as a first-class problem in internet measurement.
Technical Insight
ZAnnotate's architecture centers on a plugin-based annotator system where each metadata source implements a common interface. The core pipeline reads input line-by-line, dispatches concurrent queries to enabled annotators, and merges results into JSON output. This design elegantly separates concerns: input parsing, annotation logic, and output formatting are independent components.
A basic invocation combines local databases with live lookups. Here's enriching a list of IPs with geolocation and ASN data:
cat ip_list.txt | zannotate \
--annotator=geoip2-city \
--geoip2-city-database=/path/to/GeoLite2-City.mmdb \
--annotator=routing \
--routing-mrt-file=/path/to/latest-bview.gz \
--output-file=enriched.json
This produces JSON records with nested annotations:
{
"ip": "8.8.8.8",
"geoip2_city": {
"city": "Mountain View",
"country": "United States",
"latitude": 37.386,
"longitude": -122.0838
},
"routing": {
"asn": 15169,
"as_name": "GOOGLE",
"prefix": "8.8.8.0/24"
}
}
The real power emerges when processing existing structured data. ZAnnotate can inject annotations into JSON or CSV files without destroying original fields. If you have scan results from ZGrab with protocol metadata, you can enrich them in-place:
cat zgrab_results.json | zannotate \
--input-format=json \
--input-field=ip \
--annotator=geoip2-city \
--annotator=rdns \
--threads=10
The --input-field=ip directive tells ZAnnotate where to find IP addresses in your JSON objects, and annotations are merged into the existing structure. The --threads parameter controls global concurrency, but individual annotators support fine-grained tuning.
Concurrency management is where architectural sophistication shows. Different data sources have vastly different performance characteristics: MaxMind lookups are local disk reads (thousands per second), while Censys API calls are rate-limited to one concurrent request on free tiers. ZAnnotate handles this with per-annotator thread pools:
cat large_dataset.txt | zannotate \
--annotator=geoip2-city \
--annotator=censys \
--censys-api-id=$API_ID \
--censys-api-secret=$API_SECRET \
--censys-threads=1 \
--annotator=rdns \
--rdns-threads=100 \
--threads=200
Here, global concurrency is 200 (processing 200 IPs simultaneously), but Censys operations are throttled to 1 thread while reverse DNS—which is fast and doesn't enforce strict rate limits—can run 100 parallel queries. This prevents API quota exhaustion while maximizing throughput for unrestricted sources.
The routing annotator deserves special attention. It ingests MRT (Multi-Threaded Routing Toolkit) files—snapshots of BGP routing tables from RouteViews or RIPE RIS—to provide authoritative ASN and prefix information without external API calls. This is crucial for reproducible research: instead of querying Team Cymru's real-time service (which returns current routing state), you can use historical MRT files to annotate datasets with the routing topology that existed when the data was collected. For a dataset from 2022, you'd download the corresponding MRT snapshot:
wget http://archive.routeviews.org/route-views.wide/bgpdata/2022.01/RIBS/rib.20220101.0000.bz2
bunzip2 rib.20220101.0000.bz2
zannotate --annotator=routing --routing-mrt-file=rib.20220101.0000 < ips.txt
The WHOIS and RDAP annotators showcase the tool's pragmatism about data freshness versus scale. WHOIS lookups are synchronous, slow, and often rate-limited by regional registries. For small datasets where registration details matter, they work:
echo "1.1.1.1" | zannotate --annotator=whois
But for 100,000 IPs, WHOIS is impractical. Here, the architecture's modularity shines: disable slow annotators and rely on cached/bulk sources like MaxMind or pre-downloaded threat feeds. The design doesn't force a one-size-fits-all approach.
Gotcha
ZAnnotate's biggest friction point is setup overhead. Unlike SaaS tools where you paste an IP and get instant results, ZAnnotate requires provisioning data sources. MaxMind GeoIP2 databases need manual download (free GeoLite2 requires account signup; commercial GeoIP2 costs hundreds annually). MRT routing tables are 2-4GB compressed files from RouteViews mirrors. API-based annotators need registration and key management. The documentation lists requirements, but you'll spend 30-60 minutes configuring a full pipeline.
Performance degrades dramatically when mixing local and API-based sources at scale. Annotating 1 million IPs with only local databases (GeoIP, MRT routing) completes in minutes. Adding Censys API lookups—even with paid tier limits—balloons this to hours or days. The tool has no built-in caching layer, so re-annotating the same IPs requires re-querying APIs. For iterative research workflows, you'll want to wrap ZAnnotate in scripts that checkpoint progress and cache results externally. The lack of resume functionality means a failure at record 900,000 of 1,000,000 forces reprocessing from scratch unless you've built your own pipeline safeguards.
API error handling is transparent but not sophisticated. If Censys returns a 429 rate limit error, ZAnnotate logs it and continues, leaving that annotation empty. There's no exponential backoff or automatic retry logic. For production pipelines processing valuable datasets, you'll need wrapper scripts to detect partial failures and reprocess gaps. The tool assumes you're running batch jobs where some missing annotations are acceptable, not mission-critical ETL where every record must be complete.
Verdict
Use if: You're processing thousands to millions of IP addresses for security research, threat intelligence, or internet measurement studies, and need to combine multiple metadata sources (GeoIP + ASN + reverse DNS + threat feeds) in a reproducible pipeline. It's particularly valuable if you're already using ZMap/ZGrab for scanning and have budget for commercial databases or API access. The concurrent annotator architecture and MRT routing support make it ideal for academic research requiring historical accuracy and citation-grade reproducibility. Skip if: You're doing exploratory analysis with fewer than 1,000 IPs—web UIs like IPInfo or Shodan are faster for ad-hoc queries. Avoid it for real-time streaming enrichment since there's no state management or caching. If you need only a single annotation source, simpler tools (MaxMind's geoip2 CLI, ipinfo command-line tool) have less overhead. Also skip if you can't invest in setup—downloading multi-gigabyte databases and wrangling API keys is non-negotiable. For teams wanting managed enrichment pipelines with SLAs, commercial solutions like Recorded Future or Anomali are better fits.