Back to Articles

Building a Private Breach Search Engine: How LeakDB Queries Terabytes of Credentials in Milliseconds

[ View on GitHub ]

Building a Private Breach Search Engine: How LeakDB Queries Terabytes of Credentials in Milliseconds

Hook

Searching through 10 billion leaked credentials in under 100 milliseconds sounds impossible on a single server. Yet LeakDB does exactly that using algorithms most developers learned in CS101 but never scaled to terabytes.

Context

When massive credential breaches occur—and they happen constantly—security teams face an uncomfortable dilemma. They need to know if their users' credentials are compromised, but sending employee emails to third-party services like Have I Been Pwned creates its own security concerns. Even with k-anonymity protections, some organizations cannot legally or ethically transmit user data externally, especially in healthcare, finance, or government sectors.

The alternative—downloading breach compilations and running grep against multi-terabyte text files—is laughably impractical. A single breach dump might contain billions of credentials across hundreds of gigabytes. You need something that can ingest these massive datasets once, index them intelligently, and then serve queries fast enough for real-time API usage. That's the problem LeakDB solves: providing infrastructure to run your own breach lookup service with query speeds comparable to commercial offerings, but with complete data sovereignty.

Technical Insight

LeakDB's architecture reveals how classic computer science fundamentals scale to modern big data problems when applied thoughtfully. The system has two distinct phases: the curator (indexing) and the server (querying). Understanding both illuminates why certain design decisions matter at scale.

The curator's job is to transform chaotic breach dumps into sorted, deduplicated indexes that support binary search. The challenge? These datasets are often larger than available RAM, and they contain massive duplication—the same email/password pair appears across dozens of breaches. LeakDB solves the first problem with external merge sort, a decades-old algorithm that processes data in chunks. Raw breach files are split into blocks that fit in memory, each sorted using parallel quicksort, then written to disk. A k-way merge reads from all these sorted chunks simultaneously, producing a final sorted index without ever loading everything into RAM.

The deduplication problem is more interesting. Naively, you'd need to compare every credential against every other—an O(n²) nightmare. Instead, LeakDB uses bloom filters, probabilistic data structures that answer "have we seen this before?" with no false negatives and controllable false positives. Here's how you might use it during ingestion:

// Initialize bloom filter sized for expected dataset
filter := bloom.NewWithEstimates(10000000000, 0.001) // 10B items, 0.1% false positive rate

// During ingestion
for credential := range breachData {
    credHash := hash(credential)
    
    if filter.TestAndAdd(credHash) {
        // Probably a duplicate, skip it
        stats.Duplicates++
        continue
    }
    
    // First time seeing this, add to index
    index.Add(credential)
}

This approach reduces index size dramatically. In real breach compilations, overlap can be 40-60% across datasets. The bloom filter catches most duplicates with minimal memory—even a filter for 10 billion items uses only about 2GB of RAM at 0.1% false positive rate.

Once indexed, the server performs binary search on the sorted credential files. This is where the sorting investment pays off. Binary search provides O(log n) lookup time, meaning searching 10 billion records takes only about 33 comparisons. LeakDB's indexes are sorted by email/username, so a query like "check if user@example.com is breached" becomes a straightforward binary tree walk:

func (idx *Index) Search(query string) ([]Credential, error) {
    file, err := os.Open(idx.FilePath)
    if err != nil {
        return nil, err
    }
    defer file.Close()
    
    // Binary search for first occurrence
    left, right := int64(0), idx.TotalRecords
    
    for left < right {
        mid := (left + right) / 2
        
        // Seek to record position (fixed-width records make this O(1))
        file.Seek(mid * idx.RecordSize, io.SeekStart)
        
        record := readRecord(file)
        
        if record.Email < query {
            left = mid + 1
        } else {
            right = mid
        }
    }
    
    // Collect all matching records (same email, different passwords)
    return collectMatches(file, left, query)
}

The key insight? Fixed-width records (or offset indexes) make seeking to arbitrary positions instant. You're not scanning sequentially; you're jumping directly to memory positions. This is why LeakDB achieves sub-100ms queries on datasets that would take grep hours to scan.

LeakDB also offers a serverless mode using Google BigQuery. Instead of maintaining local indexes, the curator uploads sorted data to BigQuery tables. Queries become SQL:

SELECT email, password, breach_name 
FROM credentials.leaks 
WHERE email = 'user@example.com'
LIMIT 1000

BigQuery's columnar storage and distributed query engine provide similar performance without managing infrastructure. The tradeoff? You're uploading breach data to Google's cloud, which defeats the data sovereignty purpose for many organizations. But for teams that already use GCP and trust Google's security model, it eliminates the operational burden of index maintenance and server scaling.

Gotcha

The elephant in the room: LeakDB doesn't provide any breach data. You need to source credential dumps yourself, which exists in a legally and ethically gray area. While possessing breach data for legitimate security research is generally legal, downloading from certain sources, redistributing data, or using it improperly can violate computer fraud laws or privacy regulations like GDPR. Organizations must have legal counsel review their use case before deploying LeakDB.

Performance also degrades predictably under certain conditions. The binary search assumes data fits on fast local storage—preferably NVMe SSDs. If your indexes live on network-attached storage or spinning disks, seek times balloon from microseconds to milliseconds, destroying query performance. A terabyte index might require 100+ seeks per query, turning sub-100ms lookups into multi-second slogs. Additionally, the curator phase is genuinely slow for initial indexing. Processing a terabyte of breach data can take 24-48 hours even on powerful hardware due to sorting and deduplication overhead. This isn't something you run on-demand; it's a batch job that requires planning. Finally, LeakDB only searches plaintext credentials. If you need to check password hashes (common for validating user-chosen passwords against breach databases), you'll need to supplement with Troy Hunt's Pwned Passwords API or maintain separate hash indexes.

Verdict

Use if: You're a security team at an organization with strict data sovereignty requirements and legal access to breach datasets. You need to provide internal breach lookup APIs for user notification, password policy enforcement, or security monitoring without sending data to external services. You have the infrastructure to maintain sorted indexes on fast local storage and the patience for initial curator runs. The operational control and query speed justify the setup complexity. Skip if: You don't already have legitimate access to breach compilations, can use Have I Been Pwned's API within your compliance requirements, or need hash-based password checking rather than plaintext credential searches. The legal ambiguity around obtaining breach data and the operational overhead of maintaining multi-terabyte indexes make commercial alternatives more practical for most teams. Also skip if your storage is slow—network-attached storage or HDDs will sabotage LeakDB's performance guarantees.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/data-knowledge/moloch-leakdb.svg)](https://starlog.is/api/badge-click/data-knowledge/moloch-leakdb)