Back to Articles

Hunting Secrets in 50 Billion Web Pages: Inside Troll-A's WARC Processing Engine

[ View on GitHub ]

Hunting Secrets in 50 Billion Web Pages: Inside Troll-A's WARC Processing Engine

Hook

Common Crawl's March 2024 dataset contains 3.6 billion web pages. Processing all of them for leaked secrets would take a single-threaded tool approximately 8 years. Troll-A can do it in weeks.

Context

Web archives are a goldmine for security researchers. The Internet Archive and Common Crawl have collectively preserved petabytes of historical web data in WARC (Web ARChive) format—snapshots of websites, forums, pastebins, and documentation that may contain API keys, credentials, or tokens that were accidentally exposed and later removed. Traditional secret scanning tools like Gitleaks and TruffleHog excel at scanning Git repositories, but they're fundamentally mismatched for WARC files: they don't handle the format's structure, can't decode its various compression schemes, and weren't designed for the scale where a single dataset might contain billions of HTTP responses.

Troll-A emerged to fill this gap. It's a specialized Go-based CLI tool that treats WARC archives as first-class citizens, understanding their record structure, supporting every compression format used by Common Crawl (including custom ZStd dictionaries for megawarc files), and processing them with a concurrent architecture designed for long-running distributed jobs. Rather than reinventing secret detection, it leverages Gitleaks' battle-tested ruleset—166+ regex patterns for everything from AWS keys to Slack tokens—while adding the plumbing necessary to apply those rules at petabyte scale.

Technical Insight

The architectural brilliance of Troll-A lies in its layered concurrency model that decouples I/O from processing. At the bottom layer, it supports multiple input sources—HTTP/S URLs, S3 buckets, local filesystem paths, or STDIN—each with its own retry logic and timeout handling. The middle layer handles decompression, where Troll-A automatically detects and decodes GZip, BZip2, XZ, and ZStd formats, including Common Crawl's custom ZStd dictionaries for megawarc files. The top layer distributes WARC records across configurable worker pools for parallel regex matching.

Here's a typical invocation for scanning a Common Crawl segment with URL filtering:

troll-a \
  --input "https://data.commoncrawl.org/crawl-data/CC-MAIN-2024-10/segments/xxx.warc.gz" \
  --jobs 16 \
  --url-regex "(pastebin\.com|github\.com|gitlab\.com)" \
  --preset secret \
  --format json \
  --output findings.jsonl

The --url-regex flag is perhaps Troll-A's most critical performance feature. Instead of running expensive regex patterns against every record in a WARC file (which might include CSS files, images, or irrelevant pages), you filter before scanning. In practice, this reduces processing time by 80-90% when hunting for specific leak sources. The pattern is applied to the WARC record's target URL, not the content, so it's a cheap string match that happens before the heavyweight Gitleaks rules execute.

The real performance multiplier comes from the optional RE2 integration. Go's standard library regex engine is notoriously slow for complex patterns with backtracking. RE2, Google's alternative implementation, guarantees linear time complexity by avoiding backtracking entirely. When compiled with CGO_ENABLED=1 and go build -tags re2, Troll-A swaps engines:

// Simplified version of the regex matching logic
func (s *Scanner) matchSecrets(content []byte, rules []Rule) []Finding {
    var findings []Finding
    
    for _, rule := range rules {
        // With -tags re2, this uses github.com/moovweb/go-re2
        // Without, falls back to stdlib regexp
        matches := rule.Pattern.FindAllSubmatchIndex(content, -1)
        
        for _, match := range matches {
            findings = append(findings, Finding{
                RuleID: rule.ID,
                Match: string(content[match[0]:match[1]]),
                Line: countLines(content, match[0]),
            })
        }
    }
    
    return findings
}

This architectural choice—making RE2 optional rather than mandatory—is pragmatic but frustrating. Prebuilt binaries use stdlib regex for portability (no C dependencies), but suffer 5-10x slower performance. To unlock Troll-A's advertised 34,000 pages/second throughput, you must compile from source with libre2-dev installed. It's a classic convenience-versus-performance tradeoff.

Troll-A's preset system controls which Gitleaks rules activate. The secret preset (recommended) includes only high-confidence patterns like AWS keys and JWT tokens. The most preset adds medium-confidence rules, while all includes everything, dramatically increasing false positives. You can also supply custom Gitleaks TOML configurations:

[[rules]]
id = "custom-api-key"
description = "Internal API key pattern"
regex = '''api_key_[a-zA-Z0-9]{32}'''

[[rules.entropies]]
Min = "4.5"
Max = "8"
Group = "1"

The entropy check is particularly clever—it measures the randomness of captured groups to reduce false positives. A string like api_key_aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa has low entropy and likely isn't a real secret, while api_key_9kF2mN8pQ1xR7vL4sT6wH3jD5zC0yB1e has high entropy and warrants investigation.

Gotcha

The prebuilt binary trap catches nearly everyone. You download the release, run it, and wonder why it's processing 3,000 pages per second instead of the advertised 34,000. The answer is buried in the README: you need RE2 support, which requires compiling from source with CGO enabled and libre2-dev installed. On Ubuntu, that's apt-get install libre2-dev, then CGO_ENABLED=1 go build -tags re2. Miss this step and you're running at one-tenth the performance, which matters enormously when a single Common Crawl segment might contain 100GB of compressed data.

Even with optimal configuration, processing entire Common Crawl datasets is a marathon, not a sprint. A single monthly crawl contains roughly 400 segments of ~100GB each compressed. At 34,000 pages/second, you're looking at weeks of processing time on a single powerful machine, or months on modest hardware. Troll-A is designed for distributed deployment—spin up multiple instances processing different segments in parallel—but it doesn't provide the orchestration layer. You need to build that yourself with Kubernetes jobs, AWS Batch, or similar infrastructure. The tool also produces raw findings without validation; regex-based detection inherently generates false positives, especially with broader presets. Budget time for post-processing, deduplication, and manual review of results.

Verdict

Use if: You're hunting for leaked credentials in web archives at scale (Common Crawl, Internet Archive), have infrastructure for long-running distributed jobs, and can invest 30 minutes compiling from source to unlock RE2 performance. It's purpose-built for this use case and nothing else comes close. Skip if: You're scanning source code repositories (just use Gitleaks directly), need real-time or CI/CD secret scanning (this is for batch analysis), can't stomach compilation requirements for decent performance, or are working with small datasets where general-purpose tools suffice. This is a power tool for a specific job—embrace the specialization or look elsewhere.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/automation/crissyfield-troll-a.svg)](https://starlog.is/api/badge-click/automation/crissyfield-troll-a)