Catching Phishers with Certificate Transparency: Inside phishing_catcher's Real-Time Detection Engine

Hook

Every hour, thousands of SSL certificates are issued for domains that will never serve legitimate traffic—they're phishing infrastructure spinning up before anyone knows to block them. What if you could spot them the moment they request a certificate?

Context

Certificate Transparency (CT) logs became mandatory for all public SSL/TLS certificates in 2018, creating an unintended side effect: a real-time broadcast of every domain getting ready to serve HTTPS traffic. While CT logs exist to prevent fraudulent certificates from major CAs, they've become a goldmine for threat intelligence. Attackers building phishing infrastructure follow predictable patterns—they register domains like "paypa1-secure-login.com" or "apple-account-verify.net" and immediately request certificates to appear legitimate to victims.

Before tools like phishing_catcher, security teams relied on passive DNS databases, honeypots, and user reports—all reactive approaches that only caught phishing sites after victims were compromised. The innovation here is simple but powerful: watch the public CT logs via CertStream's WebSocket feed, score each domain against suspicious patterns, and alert on anything that smells like phishing. It's proactive threat hunting that requires no active scanning, no infrastructure beyond a Python script, and catches attackers at the moment they're setting up shop.

Technical Insight

System architecture — auto-generated

Phishing_catcher's architecture is refreshingly straightforward: it opens a WebSocket connection to CertStream, receives a JSON stream of certificate issuances, extracts domain names, and runs them through a scoring engine defined in YAML files. The core magic happens in score_domain(), which tokenizes domain names and matches them against keyword lists with associated point values.

Here's how the scoring works in practice. The suspicious.yaml file defines patterns like this:

keywords:
  'login': 25
  'verify': 20
  'account': 15
  'secure': 10
  'update': 10
  'banking': 25
  'paypal': 60
  'apple': 50

suspicious_tlds:
  '.tk': 20
  '.ml': 20
  '.ga': 20

When a certificate for "apple-id-verify-secure.tk" appears, the scorer adds 50 (apple) + 20 (verify) + 10 (secure) + 20 (.tk TLD) = 100 points, triggering a "Suspicious" alert. The script uses simple string matching with in operators, checking if keywords exist as substrings within the domain after normalizing hyphens and dots.

The real power lies in the dual-file system. While suspicious.yaml contains generic phishing indicators, external.yaml lets you add organization-specific targets without modifying core detection logic:

with open('suspicious.yaml', 'r') as f:
    suspicious = yaml.safe_load(f)
with open('external.yaml', 'r') as f:
    external = yaml.safe_load(f)

for keyword in suspicious['keywords']:
    if keyword in domain:
        score += suspicious['keywords'][keyword]
for keyword in external['keywords']:
    if keyword in domain:
        score += external['keywords'][keyword]

This separation means you can track phishing campaigns targeting your organization's name ("companyname-sso") or brand variations without polluting the upstream-maintainable suspicious patterns.

The CertStream integration is minimal but effective. The tool uses the certstream Python library to handle WebSocket reconnection logic and certificate parsing:

import certstream

def callback(message, context):
    if message['message_type'] == "certificate_update":
        all_domains = message['data']['leaf_cert']['all_domains']
        for domain in all_domains:
            score = score_domain(domain)
            if score >= 65:
                print(f"[{score}] {domain}")

certstream.listen_for_events(callback)

One clever optimization: phishing_catcher uses confusables detection through entropy calculation, catching domains that mix character sets ("payp⍺l.com" using Greek alpha) or use excessive hyphens/dots. The entropy() function applies Shannon entropy to detect randomness that's common in automatically-generated phishing domains but rare in legitimate ones.

The threshold system (65/80/90 for Potential/Likely/Suspicious) provides triage levels. In practice, anything above 80 deserves immediate attention, 65-80 needs filtering through additional context, and below 65 is mostly noise. The color-coded terminal output uses ANSI escape codes to make high-priority alerts visually pop during monitoring sessions.

Gotcha

The false positive problem is real and will test your patience. Run this tool for an hour and you'll see alerts for "secure-checkout.shopify.com" subdomains, "login.microsoft-partners.xyz" from legitimate partner programs, and every startup with "bank" in their name getting penalized. The keyword matching has no semantic understanding—it can't distinguish between "chase.com" (legitimate bank) and "chase-bank-verify.tk" (obvious phishing). You'll need additional filtering, manual review, or integration with domain reputation APIs to make this operationally useful.

More problematic: sophisticated phishing operations know about Certificate Transparency monitoring. They've adapted by using generic domain names ("update9472.com") that score low, only revealing phishing content through URL paths ("update9472.com/paypal/login") that never touch CT logs. Others abuse legitimate services—phishing sites hosted on compromised WordPress installations or serverless platforms use certificates for the parent domain, completely invisible to this detection approach. The tool catches lazy attackers beautifully but struggles with the professional tier of phishing infrastructure. It's also worth noting that CertStream itself occasionally drops connections or lags behind real-time by several minutes during high-volume periods, creating gaps in coverage.

Verdict

Use if: You're building a SOC monitoring dashboard and want an early-warning feed for potential phishing infrastructure, especially targeting specific brands or keywords relevant to your organization. It's perfect for threat intelligence teams who'll manually investigate alerts, security researchers studying phishing trends, or as one input to a SOAR platform that correlates with other signals. The customizable YAML files make it adaptable to specific use cases without code changes. Skip if: You need a low-noise, automated blocking solution or expect this to catch sophisticated phishing operations. The false positive rate demands human review, making it unsuitable for automated takedown workflows. Also skip if you're only concerned with credential theft after sites are live—this catches infrastructure setup, not active campaigns, so you'll still need URL scanning and user reporting mechanisms as downstream detection layers.

Catching Phishers with Certificate Transparency: Inside phishing_catcher's Real-Time Detection Engine

Catching Phishers with Certificate Transparency: Inside phishing_catcher's Real-Time Detection Engine

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

Catching Phishers with Certificate Transparency: Inside phishing_catcher's Real-Time Detection Engine

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// RELATED

Open Interpreter: Running GPT-4 with Root Access to Your Machine

Accomplish: Why Wrapping OpenCode Instead of Building an Agent Runtime Was the Right Bet

NVIDIA Cosmos: A Case Study in Strategic Repository Deprecation

How Ripgrep Makes Searching 10x Faster Than Grep: A Deep Dive Into Rust-Powered Text Search

Open Interpreter: Running GPT-4 with Root Access to Your Machine

Accomplish: Why Wrapping OpenCode Instead of Building an Agent Runtime Was the Right Bet

NVIDIA Cosmos: A Case Study in Strategic Repository Deprecation

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]