Harvesting Certificate Transparency Logs at Scale with Axeman

Hook

Over 2 billion SSL/TLS certificates are logged publicly every year, and most security teams have no idea which ones reference their domains. Axeman turns this transparency firehose into actionable CSV data.

Context

Certificate Transparency was introduced in 2013 to combat fraudulent SSL certificates after high-profile incidents where certificate authorities mistakenly or maliciously issued certificates for domains they shouldn't have. Google mandated CT logging for all certificates in Chrome, creating public append-only logs that record every certificate issued by participating CAs. This transparency is powerful for security, but it created a new problem: how do you actually retrieve and analyze these massive datasets?

The CT log ecosystem consists of dozens of independent log servers operated by Google, Cloudflare, DigiCert, and others. Each log exposes an HTTP API for retrieving certificate entries, but downloading millions of certificates serially is prohibitively slow. Security researchers need this data for various purposes: discovering shadow IT, monitoring for phishing domains that target their brands, identifying certificate expiration risks, or conducting SSL/TLS security research. Axeman emerged as a specialized tool to solve the batch download problem, leveraging Python's concurrency features to parallelize what would otherwise take days into hours.

Technical Insight

System architecture — auto-generated

Axeman's architecture revolves around three core components: a CT log discovery mechanism, a concurrent download engine, and a certificate parser that extracts structured data. The tool supports two operational modes—targeting a specific CT log or enumerating all known logs from the certificate-transparency.org registry.

The concurrency model is where Axeman shines. CT logs expose their data through sequential integer-indexed entries, which makes them embarrassingly parallel. Axeman divides the entry range into chunks and processes multiple chunks simultaneously using Python's multiprocessing module for CPU-bound parsing work and likely asyncio for I/O-bound HTTP requests. This hybrid approach maximizes throughput: async handles the network waiting, while multiprocessing spreads parsing across CPU cores.

Here's how you'd use Axeman to download certificates from Google's Pilot CT log:

# Basic usage - download from a specific CT log
axeman --log https://ct.googleapis.com/pilot \
       --output certificates.csv \
       --concurrency 10

# Download from all known CT logs
axeman --all \
       --output-dir ./ct_data/ \
       --concurrency 20

The --concurrency flag controls how many parallel workers process chunks simultaneously. Setting this too high overwhelms CT log servers and triggers rate limiting; too low and you're leaving performance on the table. Most CT logs can handle 10-20 concurrent connections comfortably.

Under the hood, Axeman makes requests to the CT log's /ct/v1/get-entries endpoint, which returns batches of base64-encoded certificate chains. Each entry must be decoded, parsed using a library like cryptography or pyOpenSSL, and then relevant fields extracted. The parsing stage is CPU-intensive because X.509 certificate ASN.1 decoding involves significant computation.

The CSV output format includes critical certificate fields: common name, subject alternative names (SANs), issuer, validity period, and fingerprints. This flat structure makes it trivial to load into pandas, SQLite, or even Excel for quick analysis:

import pandas as pd

# Load Axeman output and find certificates for your domain
df = pd.read_csv('certificates.csv')
target_certs = df[df['dns_names'].str.contains('example.com', na=False)]
print(f"Found {len(target_certs)} certificates mentioning example.com")

The multiprocessing approach does introduce some operational complexity. Each worker process maintains its own HTTP connection pool and certificate parser state. Axeman needs to coordinate these workers, aggregate results, and handle failures gracefully. If a single chunk fails to download due to network issues, the tool should retry without blocking other workers—a pattern that requires careful exception handling and potentially a work queue implementation.

One clever design choice is writing output incrementally rather than buffering everything in memory. When processing logs containing tens of millions of certificates, holding all parsed data in RAM would require dozens of gigabytes. By streaming parsed certificates directly to CSV files, Axeman keeps memory usage bounded regardless of dataset size. This streaming architecture is critical for long-running harvesting jobs that might take hours or days.

Gotcha

The CSV-only output format becomes a serious limitation when working with CT data at scale. A single busy CT log can contain 50+ million certificates, resulting in multi-gigabyte CSV files that are slow to parse and impossible to query efficiently. You can't easily ask questions like "show me all certificates issued in the last 30 days for *.example.com" without loading the entire dataset and filtering in memory. For production use cases, you'll quickly find yourself writing additional tooling to import Axeman's CSV output into PostgreSQL, Elasticsearch, or another queryable data store.

Another significant limitation is the lack of incremental update support. CT logs are append-only, constantly growing as new certificates are issued. Axeman appears designed for one-shot bulk downloads rather than keeping a local dataset synchronized with upstream logs. If you run Axeman today and again next week, you'll re-download millions of certificates you already have. There's no built-in mechanism to track the highest entry index you've previously fetched and only retrieve newer entries. For continuous monitoring scenarios, this creates wasted bandwidth and processing time. You'd need to build your own bookmarking system on top of Axeman or switch to a streaming-based tool.

Performance tuning also requires some trial and error. The optimal concurrency level varies dramatically based on your network bandwidth, CPU capabilities, and which CT log you're querying. Google's logs can handle higher concurrency than smaller operators. There's no auto-tuning—you need to experiment and monitor for HTTP 429 rate limit responses or timeouts. Additionally, some CT logs have been deprecated or are no longer accepting new entries, but Axeman doesn't automatically filter these out when using the --all flag, leading to wasted effort querying dead logs.

Verdict

Use Axeman if you need to perform bulk certificate harvesting for security research, one-time domain enumeration projects, or building an initial dataset for SSL/TLS analysis where CSV output meets your needs. It's ideal when you're running periodic snapshots (weekly or monthly) and can tolerate full re-downloads, or when you're targeting specific CT logs for focused investigation. The tool excels at turning a complex distributed system into a simple CSV file you can grep, load into pandas, or import into your analysis pipeline. Skip Axeman if you need real-time or near-real-time CT monitoring—use CertStream instead for streaming updates. Also skip it if you're building a production certificate tracking system that requires a database backend, complex queries, or incremental updates. In those cases, you'll end up rewriting half the tool anyway to add the missing pieces. Finally, skip it if you're working in environments that require sophisticated error handling, observability, or integration with existing data pipelines—Axeman is a focused batch download tool, not an enterprise data integration platform.

Harvesting Certificate Transparency Logs at Scale with Axeman

Harvesting Certificate Transparency Logs at Scale with Axeman

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

Harvesting Certificate Transparency Logs at Scale with Axeman

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// RELATED

Open Interpreter: Running GPT-4 with Root Access to Your Machine

Accomplish: Why Wrapping OpenCode Instead of Building an Agent Runtime Was the Right Bet

NVIDIA Cosmos: A Case Study in Strategic Repository Deprecation

How Ripgrep Makes Searching 10x Faster Than Grep: A Deep Dive Into Rust-Powered Text Search

Open Interpreter: Running GPT-4 with Root Access to Your Machine

Accomplish: Why Wrapping OpenCode Instead of Building an Agent Runtime Was the Right Bet

NVIDIA Cosmos: A Case Study in Strategic Repository Deprecation

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]