Back to Articles

Project Sonar: Mining Rapid7's Internet-Scale Security Datasets Without Building Your Own Crawler

[ View on GitHub ]

Project Sonar: Mining Rapid7’s Internet-Scale Security Datasets Without Building Your Own Crawler

Hook

Every month, Rapid7 scans nearly 4 billion IPv4 addresses across multiple protocols and publishes the raw data for free. Most security researchers still don’t know this goldmine exists.

Context

Conducting internet-wide security research used to require either massive infrastructure investment or reliance on expensive commercial platforms like Shodan. A single comprehensive scan of the IPv4 space across multiple protocols could cost tens of thousands of dollars in cloud compute and bandwidth, not to mention the legal and ethical complexities of scanning networks you don’t own. Academic researchers and independent security analysts were largely locked out of large-scale internet measurement studies.

Project Sonar emerged from Rapid7’s security research initiatives as a public good: regular, methodical scans of the entire IPv4 address space across protocols like HTTP, HTTPS, DNS, SSL/TLS, SSH, and more. Rather than gatekeeping this data behind paywalls, Rapid7 publishes the raw datasets openly, updated monthly or quarterly depending on the protocol. This democratizes internet-scale security research, enabling anyone to study vulnerability distributions, track certificate deployments, analyze DNS configurations, or measure the adoption of security protocols without scanning a single IP address themselves.

Technical Insight

Publication

Data Processing

Scanning Infrastructure

Scan requests

HTTP/HTTPS/DNS/SSL/TLS

Scan responses

Raw protocol data

Processed datasets

Published by protocol/date

Research & Analysis

Distributed Scanners

Protocol Probes

Response Collection

Anonymization & Validation

Compressed JSONL Datasets

Public Repository

IPv4 Address Space

Security Researchers

System architecture — auto-generated

Project Sonar operates on a scan-collect-publish pipeline that produces protocol-specific datasets, each with distinct schemas and update frequencies. The data is distributed as compressed JSON files, typically organized by scan date and protocol type. Understanding how to efficiently work with these massive datasets is key to extracting value.

The most commonly used datasets include forward DNS (A/AAAA record enumeration), reverse DNS (PTR records), SSL/TLS certificates, HTTP/HTTPS responses, and various UDP protocol responses. Each dataset follows a line-delimited JSON format (JSONL), where each line represents a single scan result. For example, an SSL certificate scan result might look like this:

{"ip":"93.184.216.34","port":443,"timestamp":"2024-01-15T08:23:17Z","data":{"tls":{"version":"TLSv1.3","cipher_suite":"TLS_AES_256_GCM_SHA384","certificate":{"subject":{"common_name":"www.example.org"},"issuer":{"common_name":"DigiCert TLS RSA SHA256 2020 CA1"},"validity":{"not_before":"2023-09-01T00:00:00Z","not_after":"2024-09-30T23:59:59Z"},"serial_number":"0A3B4C5D6E7F8A9B0C1D2E3F4A5B6C7D","signature_algorithm":"sha256WithRSAEncryption","subject_alternative_names":["example.org","www.example.org"]}}}}

Processing these datasets efficiently requires streaming parsers rather than loading entire files into memory. A typical Sonar dataset can be 50-200GB compressed, expanding to several hundred gigabytes uncompressed. Here’s a Python example for analyzing SSL certificate expirations:

import gzip
import json
from datetime import datetime
from collections import Counter

def analyze_cert_expiration(sonar_file):
    expiring_soon = 0
    expired = 0
    key_sizes = Counter()
    now = datetime.utcnow()
    
    with gzip.open(sonar_file, 'rt') as f:
        for line in f:
            try:
                record = json.loads(line)
                cert = record.get('data', {}).get('tls', {}).get('certificate', {})
                
                # Parse expiration
                not_after = datetime.fromisoformat(
                    cert.get('validity', {}).get('not_after', '').replace('Z', '')
                )
                
                days_until_expiry = (not_after - now).days
                if days_until_expiry < 0:
                    expired += 1
                elif days_until_expiry < 30:
                    expiring_soon += 1
                
                # Track key algorithms
                pub_key = cert.get('public_key_algorithm', 'unknown')
                key_sizes[pub_key] += 1
                
            except (json.JSONDecodeError, ValueError, KeyError):
                continue
    
    return {
        'expired': expired,
        'expiring_soon': expiring_soon,
        'key_algorithms': dict(key_sizes.most_common(10))
    }

The data access pattern is deliberately simple: datasets are hosted on Amazon S3 and accessible via direct HTTPS downloads. Rapid7 publishes manifests listing available datasets with metadata about scan dates, file sizes, and checksums. This allows you to build automated pipelines that fetch new datasets as they’re published and perform differential analysis against previous scans.

One powerful but underutilized capability is cross-protocol correlation. Because Sonar scans the same IP space across multiple protocols, you can identify patterns like servers running both vulnerable SSH versions and outdated TLS configurations. This multi-dimensional view reveals clusters of poorly maintained infrastructure that single-protocol scans would miss.

The historical archive is equally valuable for longitudinal studies. You can track how quickly a vulnerability gets patched across the internet by comparing scan results before and after disclosure. For instance, analyzing SSL/TLS datasets from late 2014 through 2015 would show the global response to the POODLE vulnerability and SSLv3 deprecation. This temporal dimension transforms static snapshots into dynamic stories about internet security evolution.

Gotcha

The most significant limitation is that Project Sonar is a data source, not a platform. There’s no query interface, no API, no real-time updates. You download gigantic files and process them yourself. If you need to quickly answer ‘which IPs are running service X right now,’ Shodan’s search interface will get you results in seconds, while Sonar requires downloading and parsing potentially hundreds of gigabytes of data. The data freshness also varies significantly by protocol—some datasets update monthly, others quarterly, making the information potentially weeks or months out of date.

The repository itself is surprisingly minimal, essentially serving as a landing page pointing to external documentation rather than a collection of tools or scanning infrastructure. Don’t expect reusable scanning code or data processing utilities. You’ll need to build your own parsers, analysis pipelines, and storage solutions. The bandwidth and storage costs can also add up quickly if you’re regularly downloading full datasets rather than working with diffs. Additionally, certain types of analysis require contextual data that Sonar doesn’t provide—geolocation, ASN ownership, service banners beyond basic protocol responses—meaning you’ll need to enrich the data with other sources for comprehensive analysis.

Verdict

Use if: You’re conducting security research requiring internet-scale data (vulnerability prevalence, protocol adoption, certificate ecosystem studies), building threat intelligence pipelines that need historical baselines, performing academic research where reproducibility and authoritative datasets matter, or you want to avoid the legal and ethical complexity of scanning networks yourself. The cost savings alone make it worthwhile for any analysis that would otherwise require commercial platforms. Skip if: You need real-time or near-real-time data for incident response, want an interactive search interface rather than batch processing, require custom scan configurations beyond standard protocol probes, or lack the storage and compute resources to process multi-hundred-gigabyte datasets. For quick one-off queries or production security monitoring, commercial alternatives like Shodan or Censys provide better user experiences despite the cost.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/developer-tools/rapid7-sonar.svg)](https://starlog.is/api/badge-click/developer-tools/rapid7-sonar)