Back to Articles

Building a Bulk Email Breach Checker with Python and Have I Been Pwned

[ View on GitHub ]

Building a Bulk Email Breach Checker with Python and Have I Been Pwned

Hook

Every 11 seconds, a new data breach occurs somewhere in the world. If you're still checking email addresses one at a time through the Have I Been Pwned website, you're doing security audits at 1990s speed.

Context

Troy Hunt's Have I Been Pwned (HIBP) has become the de facto standard for checking whether email addresses or passwords have appeared in known data breaches. With over 12 billion compromised accounts indexed across 600+ breaches, it's an essential tool for security teams, compliance officers, and anyone conducting user account audits. But the HIBP web interface is designed for individual lookups—one email at a time, with manual CAPTCHA challenges for repeated queries.

This becomes a bottleneck when you're dealing with real-world scenarios: auditing employee accounts across your organization, checking customer databases during security reviews, or verifying user imports from acquired companies. You need automation, but the HIBP API has strict rate limits and requires careful implementation to avoid getting blocked. The houbbit/haveibeenpwned tool emerged to solve this specific problem: batch processing email breach checks while respecting API boundaries and providing immediate visual feedback through color-coded console output.

Technical Insight

single email or file path

email address

HTTP GET with API key

HTTP 429

wait retry-after seconds

HTTP 200 with breach data

HTTP 404 clean

color-coded output

next email from file

CLI Input Handler

Input Parser

API Request Wrapper

Have I Been Pwned API

Rate Limit Handler

Result Processor

Console Output

System architecture — auto-generated

The architecture of houbbit/haveibeenpwned is deliberately minimal—a single Python script that wraps the HIBP API with intelligent rate limiting. The key design decision here is treating rate limits not as errors but as expected behavior that requires adaptive response. Unlike naive implementations that simply fail when hitting HTTP 429 responses, this tool implements a backoff-and-retry pattern that makes it production-ready for bulk operations.

The core logic centers around a simple but effective request wrapper. When you invoke the script with either a single email (python haveibeenpwned.py user@example.com) or a file containing multiple addresses (python haveibeenpwned.py emails.txt), it processes each request through the HIBP API with automatic pause detection:

import requests
import sys
import time
from colorama import Fore, Style

def check_email(email):
    url = f"https://haveibeenpwned.com/api/v3/breachedaccount/{email}"
    headers = {"hibp-api-key": API_KEY, "user-agent": "haveibeenpwned-checker"}
    
    response = requests.get(url, headers=headers)
    
    if response.status_code == 429:
        retry_after = int(response.headers.get('Retry-After', 2))
        print(f"Rate limited. Waiting {retry_after} seconds...")
        time.sleep(retry_after)
        return check_email(email)  # Recursive retry
    
    if response.status_code == 200:
        breaches = response.json()
        print(f"{Fore.RED}{email}: PWNED in {len(breaches)} breaches{Style.RESET_ALL}")
        return True
    elif response.status_code == 404:
        print(f"{Fore.GREEN}{email}: Clean{Style.RESET_ALL}")
        return False

The recursive retry pattern on HTTP 429 is elegant because it respects the Retry-After header provided by the API, rather than using arbitrary sleep intervals. This means you're working with the API's actual capacity signals, not guessing at appropriate delays. The HIBP API typically returns a 2-second retry window, but by checking the header dynamically, the script adapts to any changes in the service's rate limiting policy.

The batch processing mode reads a file line-by-line, which keeps memory usage constant regardless of input size—you can process 10,000 emails with the same memory footprint as 10. The color-coding via the colorama library provides instant visual parsing: red for pwned accounts that need attention, green for clean addresses. This matters more than you'd think during live audits—when you're watching results stream by, pattern recognition through color is significantly faster than parsing text.

One subtle but important implementation detail: the tool requires an API key as an environment variable rather than hardcoding it or passing it as a command-line argument. This follows security best practices for credential management and makes it safer to use in shared environments or CI/CD pipelines where command-line arguments might be logged. The actual implementation checks for HIBP_API_KEY in the environment and fails fast if it's missing:

import os

API_KEY = os.environ.get('HIBP_API_KEY')
if not API_KEY:
    sys.exit("Error: HIBP_API_KEY environment variable not set")

The tradeoff here is simplicity versus features. There's no caching layer, no persistent storage of results, no detailed breach information beyond counts. These omissions are actually strategic—adding those features would complicate the codebase and introduce failure modes. For batch checking hundreds or thousands of emails, you want the tool to be robust and predictable, not feature-rich and fragile. You can always pipe the output to a file for persistence (python haveibeenpwned.py emails.txt > results.txt) and parse it later if needed.

Gotcha

The most significant limitation is that this tool only checks email addresses against account breaches—it doesn't support password checking through HIBP's Pwned Passwords API, nor does it provide username lookups. If you're conducting a full security audit that includes verifying whether users are employing compromised passwords, you'll need additional tooling. This is a single-purpose instrument, not a security Swiss Army knife.

Rate limiting remains a real constraint even with intelligent handling. The HIBP API for email searches requires an API key (which costs money beyond the free tier) and enforces strict rate limits even for paid accounts. For very large datasets—think tens of thousands of email addresses—you're looking at hours of runtime due to the enforced delays between requests. The tool doesn't implement concurrent requests or threading, which means it's inherently sequential. This is actually correct behavior given HIBP's terms of service, but it means you can't simply throw more compute resources at the problem to speed it up. Plan your batch operations accordingly, and don't expect to audit 50,000 accounts during your lunch break. The tool also provides no progress indication for batch jobs, so you're left watching output stream by without knowing whether you're 10% or 90% complete—frustrating for large jobs.

Verdict

Use if: You need to audit email addresses against breach databases in bulk, you're comfortable with command-line tools, and you value simplicity and reliability over feature richness. This tool excels for security teams doing periodic employee account audits, compliance checks before user imports, or one-time customer database reviews. The rate limiting respect makes it safe to run unsupervised, and the minimal dependencies mean it works reliably across environments. Skip if: You need comprehensive breach details, want to check passwords or usernames, require progress tracking for large batches, or need results stored in structured formats like JSON or CSV. Also skip if you're building this functionality into a larger application—in that case, use a full-featured HIBP client library like pwnedpasswords or integrate directly with the API for better control. For standalone email breach checking, though, this tool hits the sweet spot of doing one thing well without the complexity baggage of more ambitious projects.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/developer-tools/houbbit-haveibeenpwned.svg)](https://starlog.is/api/badge-click/developer-tools/houbbit-haveibeenpwned)