Project Sonar: Mining Rapid7's Internet-Wide Scanning Datasets for Security Research

Hook

Every month, Rapid7 scans the entire public IPv4 address space—all 4.3 billion addresses—and publishes the results for free. Most security researchers don't know this dataset exists.

Context

Before Project Sonar launched in 2013, comprehensive internet-wide scanning data was either prohibitively expensive or simply unavailable to independent security researchers. Organizations like universities or small security firms couldn't afford the infrastructure to scan billions of hosts across hundreds of ports, nor did they have the bandwidth, legal resources, or distributed architecture required for such operations. This created a significant barrier to large-scale security research: understanding vulnerability prevalence, tracking the adoption of security protocols like TLS 1.3, or measuring the deployment of new services required either institutional backing or commercial data subscriptions.

Project Sonar emerged from Rapid7's recognition that democratizing internet measurement data would accelerate security research across the industry. Rather than keeping their scanning infrastructure's output proprietary, they began publishing comprehensive datasets covering forward and reverse DNS, SSL/TLS certificates, HTTP/HTTPS responses, and various application protocols. The project represents an ongoing commitment to open data, with regular scanning cycles producing fresh datasets that researchers can download and analyze without conducting their own internet-wide scans—an approach that sidesteps the ethical, legal, and technical complexity of running large-scale active reconnaissance.

Technical Insight

System architecture — auto-generated

Project Sonar's value proposition centers on its datasets rather than scanning tools. The repository itself primarily contains documentation, metadata schemas, and usage examples rather than the scanning infrastructure code. Researchers access the data through AWS S3 buckets or direct downloads, with datasets organized by scan type and date. A typical Sonar dataset includes JSON or CSV files containing billions of records, each representing a single host's response to a specific probe.

The data structure varies by scan type, but SSL/TLS certificate datasets exemplify the depth available. Each certificate record includes not just the certificate chain, but also cipher suite negotiation results, protocol versions supported, and timing metadata. Here's a typical workflow for analyzing certificate data using Python:

import json
import gzip
from collections import Counter
from datetime import datetime

# Sonar data comes compressed; process line-by-line to manage memory
def analyze_certificate_authorities(sonar_file):
    ca_distribution = Counter()
    expired_certs = 0
    
    with gzip.open(sonar_file, 'rt') as f:
        for line in f:
            record = json.loads(line)
            
            # Extract issuer from certificate chain
            if 'chain' in record and len(record['chain']) > 0:
                root_cert = record['chain'][-1]
                issuer = root_cert.get('issuer', {}).get('common_name', 'Unknown')
                ca_distribution[issuer] += 1
            
            # Check validity period
            not_after = record.get('not_after')
            if not_after and datetime.fromisoformat(not_after) < datetime.now():
                expired_certs += 1
    
    return ca_distribution.most_common(10), expired_certs

# Process dataset
top_cas, expired = analyze_certificate_authorities('2024_01_ssl_certs.json.gz')
print(f"Top 10 Certificate Authorities:")
for ca, count in top_cas:
    print(f"  {ca}: {count:,} certificates")
print(f"\nExpired certificates found: {expired:,}")

The datasets enable research questions that would otherwise require months of scanning infrastructure development. For instance, measuring the impact of a newly disclosed SSL vulnerability requires identifying how many internet-facing hosts support the affected cipher suite. With Sonar data, you download the latest SSL scan results and filter by cipher suite—a few hours of processing instead of weeks of scanning. The historical depth is equally valuable: comparing datasets from 2020 and 2024 reveals TLS 1.3 adoption curves, certificate authority market share shifts, or the retirement timeline for deprecated protocols.

One particularly powerful use case involves DNS reconnaissance. Sonar's forward and reverse DNS datasets contain billions of hostname-to-IP and IP-to-hostname mappings. Security researchers use these to identify infrastructure patterns, track hosting provider usage, or discover subdomains associated with specific organizations. The reverse DNS data, in particular, exposes hosting patterns that aren't visible through traditional DNS queries:

import re

def find_infrastructure_patterns(rdns_file, pattern):
    """
    Find hosting patterns in reverse DNS data
    Example: Identify all hosts belonging to a cloud provider's IP space
    """
    matches = []
    pattern_re = re.compile(pattern, re.IGNORECASE)
    
    with gzip.open(rdns_file, 'rt') as f:
        for line in f:
            record = json.loads(line)
            ip = record.get('ip')
            hostnames = record.get('names', [])
            
            for hostname in hostnames:
                if pattern_re.search(hostname):
                    matches.append({'ip': ip, 'hostname': hostname})
    
    return matches

# Find all AWS-hosted infrastructure
aws_hosts = find_infrastructure_patterns(
    '2024_01_rdns.json.gz',
    r'amazonaws\.com$'
)
print(f"Found {len(aws_hosts):,} AWS-hosted systems")

The data processing requirements are substantial. A single month's SSL certificate dataset might contain 100+ million records and occupy 50GB compressed. Researchers typically use distributed processing frameworks like Apache Spark for large-scale analysis, or stream-process data line-by-line as shown above to avoid loading entire datasets into memory. Rapid7 provides JSON Schema definitions for each dataset type, which is critical for parsing the nested structures reliably.

What sets Sonar apart from real-time services like Shodan is the completeness and temporal consistency. Each scan represents a point-in-time snapshot of the entire IPv4 space, enabling true longitudinal studies. You're not sampling or querying—you're analyzing census data. This makes Sonar ideal for academic research, vulnerability impact assessment, and understanding internet-scale trends rather than real-time threat hunting.

Gotcha

The biggest limitation is that Project Sonar is purely a data distribution initiative. The GitHub repository won't help you build your own internet scanner—it's documentation for consuming existing datasets. If you need to scan custom ports, perform stateful probes beyond what Sonar covers, or require real-time data, you'll need different tools. The scanning infrastructure itself isn't open-sourced, which means you can't customize scan parameters or add new probe types. You get what Rapid7 decides to scan, on their schedule.

Dataset freshness varies significantly by scan type. Some services like HTTP are scanned monthly, while others have longer intervals or have been discontinued. There's no SLA or guarantee of update frequency, and historical datasets sometimes go offline as storage priorities shift. The raw data also requires substantial processing before it's useful—expect to write significant ETL code to filter, normalize, and enrich the datasets for your specific research questions. A single analysis might require downloading hundreds of gigabytes and processing for hours or days. Finally, the ethical implications of using this data deserve consideration: these are observations of production systems without explicit consent from operators, and researchers must handle findings responsibly, particularly when discovering vulnerabilities or misconfigurations affecting identifiable organizations.

Verdict

Use Project Sonar if you're conducting security research requiring comprehensive internet measurement data, studying protocol adoption trends across the public internet, analyzing certificate ecosystem changes, measuring vulnerability prevalence post-disclosure, or performing academic research on internet infrastructure evolution. The datasets are invaluable for questions requiring complete coverage and historical depth that would be impossible to obtain otherwise. Skip this if you need real-time threat intelligence, want to scan private networks or custom IP ranges, require interactive querying rather than bulk analysis, need scanning tools you can self-host and customize, or lack the storage and processing infrastructure to handle multi-terabyte datasets. This is a research data resource, not an operational scanning platform—plan for significant data engineering work to extract insights.

Project Sonar: Mining Rapid7's Internet-Wide Scanning Datasets for Security Research

Project Sonar: Mining Rapid7's Internet-Wide Scanning Datasets for Security Research

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

Project Sonar: Mining Rapid7's Internet-Wide Scanning Datasets for Security Research

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// RELATED

Open Interpreter: Running GPT-4 with Root Access to Your Machine

Accomplish: Why Wrapping OpenCode Instead of Building an Agent Runtime Was the Right Bet

NVIDIA Cosmos: A Case Study in Strategic Repository Deprecation

How Ripgrep Makes Searching 10x Faster Than Grep: A Deep Dive Into Rust-Powered Text Search

Open Interpreter: Running GPT-4 with Root Access to Your Machine

Accomplish: Why Wrapping OpenCode Instead of Building an Agent Runtime Was the Right Bet

NVIDIA Cosmos: A Case Study in Strategic Repository Deprecation

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]