Hunting Phishing Kits at Scale: Inside Kitphishr's Concurrent Scraping Architecture

Hook

Every compromised phishing site has a 70% chance of exposing the attacker's source code in an unsecured directory—complete with their email addresses, testing logs, and victim credentials.

Context

When blue teams investigate phishing campaigns, the gold mine isn't just blocking the fake login page—it's finding the phishing kit source code. These kits, typically distributed as zip files among criminal forums, often contain careless mistakes: attacker email addresses in configuration files, logs of victim credentials sent during testing, hardcoded Telegram bot tokens, and breadcrumbs linking multiple campaigns to the same threat actor. The problem is scale. PhishTank alone catalogs 50,000+ active phishing URLs monthly, and manually checking each site for exposed directories is impossible.

Kitphishr emerged to solve this automation gap. While general-purpose scrapers like hakrawler can crawl websites, they lack the domain knowledge that phishing infrastructure typically follows patterns: compromised WordPress sites with kits uploaded to /wp-content/uploads/, zip files left in /admin/, or entire directory listings exposed at /phishing/. Kitphishr combines this threat intelligence heuristic knowledge with Go's concurrency model to process thousands of URLs from live feeds, hunting for the exposed artifacts that turn a simple phishing report into actionable threat intelligence.

Technical Insight

Kitphishr's architecture centers on three core components: feed aggregation, concurrent HTTP traversal with path fuzzing, and artifact detection. The tool is built entirely in Go, leveraging goroutines and channels to handle hundreds of simultaneous HTTP requests without blocking.

The feed integration layer is elegantly simple. Rather than requiring users to manually compile lists of phishing URLs, Kitphishr pulls from four public sources: PhishTank, PhishStats, OpenPhish, and Phishing.Database. Each feed has its own parser since they return different formats—PhishTank uses JSON with an optional API key for higher rate limits, while OpenPhish returns plain text. When you run kitphishr -feed phishtank, it fetches the latest data, extracts URLs, and pipes them into the scanning engine. This automatic feed integration means your threat hunting always starts with fresh targets.

The HTTP traversal engine is where the concurrency shines. Kitphishr accepts URLs through stdin, then spawns a configurable number of worker goroutines (default 50, adjustable via -c flag) that process the queue. For each URL, it doesn't just fetch the homepage—it fuzzes common phishing kit paths. The tool maintains an internal wordlist of predictable directories where attackers commonly leave kits exposed: /admin/, /assets/, /files/, /upload/, /public/, and dozens more. Here's a simplified version of the core scanning logic:

func scanURL(targetURL string, paths []string, results chan<- Finding) {
    client := &http.Client{
        Timeout: 10 * time.Second,
        CheckRedirect: func(req *http.Request, via []*http.Request) error {
            return http.ErrUseLastResponse // Don't follow redirects
        },
    }

    for _, path := range paths {
        fullURL := strings.TrimRight(targetURL, "/") + path
        resp, err := client.Get(fullURL)
        if err != nil {
            continue
        }
        defer resp.Body.Close()

        if resp.StatusCode == 200 {
            body, _ := ioutil.ReadAll(resp.Body)
            // Check for directory listing indicators
            if isOpenDirectory(body) {
                results <- Finding{URL: fullURL, Type: "open_dir"}
            }
            // Hunt for zip files in HTML
            if zipFiles := extractZipLinks(body); len(zipFiles) > 0 {
                for _, zip := range zipFiles {
                    results <- Finding{URL: zip, Type: "kit_zip"}
                }
            }
        }
    }
}

The tool doesn't follow redirects intentionally—phishing sites often redirect victims to legitimate sites after collecting credentials, but the kit files remain accessible at the original URL structure. This design choice prevents the scanner from wasting time following redirect chains to legitimate banking sites.

The artifact detection logic looks for two primary signals: open directory listings and zip file references. Open directories are identified by parsing HTML for common directory listing signatures—Apache's "Index of" title, nginx's directory listing format, or even custom file managers. When found, the tool extracts all .zip file links, which are prime candidates for phishing kits. The detection regex patterns are tuned specifically for phishing kit naming conventions: login.zip, paypal-scam.zip, office365.zip, and similar.

The optional download feature (-d flag with a target directory) uses a separate goroutine pool to fetch suspected kits without blocking the main scanner. Downloads are streamed directly to disk with basic size limits to prevent filling storage with decoy files. The tool preserves original filenames but sanitizes paths to prevent directory traversal attacks in the kit filenames themselves—a nice touch showing the author's security awareness.

Concurrency control is handled through a semaphore pattern using buffered channels. The -c flag controls how many goroutines can make HTTP requests simultaneously. Set it too high (250+) and you'll hammer servers and trigger rate limiting; too low and you'll waste time on slow servers blocking faster ones. The default of 50 is a sweet spot for most networks, but security operations with dedicated infrastructure can push it higher.

Gotcha

Kitphishr's effectiveness depends entirely on attacker sloppiness, and modern phishing operations are getting more sophisticated. The tool excels at finding amateur kits on compromised servers where attackers leave directories open and files exposed, but it completely misses phishing infrastructure that uses proper access controls, non-standard paths, or authentication. If attackers use randomly generated directory names instead of predictable paths like /admin/, the fuzzing approach returns nothing. You're essentially playing a probability game based on common patterns, which works surprisingly often but will miss careful operators.

The lack of built-in safety features is concerning for less experienced users. Kitphishr will happily download hundreds of zip files containing live malware, credential stealers, and backdoored PHP scripts directly to your filesystem. There's no sandboxing, no malware scanning, and no validation that what you're downloading is actually a phishing kit versus something more dangerous. You need your own isolation strategy—dedicated VMs, containerized environments, or at minimum a separate analysis directory with no execution permissions. The tool also generates significant HTTP traffic at scale. Running it against thousands of URLs at high concurrency can look like an attack from the perspective of hosting providers, potentially getting your IP blocked or triggering abuse complaints. There's no built-in rate limiting per domain, so you could inadvertently hammer the same compromised hosting provider with hundreds of requests if multiple phishing sites share infrastructure.

Verdict

Use if: You're conducting legitimate threat intelligence research or blue team operations and need to scale phishing kit collection beyond manual discovery. The feed integration and concurrent architecture save dozens of hours compared to manual URL processing, and the exposed kits provide valuable IOCs—attacker emails, infrastructure patterns, and victim data locations—that enhance incident response. It's particularly valuable for security operations teams tracking specific phishing campaigns who need to quickly identify if newly reported URLs match known kit patterns. Skip if: You lack proper isolation for handling potentially malicious downloads, don't have a defined legal and ethical framework for this type of research, or expect the tool to find kits on professionally operated phishing infrastructure. Also skip if your use case is general web scraping—tools like meg or hakrawler offer more flexibility without the phishing-specific assumptions. This is a specialized threat hunting instrument, not a pentesting toy, and requires the operational maturity to handle both the technical risks of malicious content and the ethical responsibilities of threat intelligence work.

Hunting Phishing Kits at Scale: Inside Kitphishr's Concurrent Scraping Architecture

Hunting Phishing Kits at Scale: Inside Kitphishr's Concurrent Scraping Architecture

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

Hunting Phishing Kits at Scale: Inside Kitphishr's Concurrent Scraping Architecture

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// RELATED

Open Interpreter: Running GPT-4 with Root Access to Your Machine

Accomplish: Why Wrapping OpenCode Instead of Building an Agent Runtime Was the Right Bet

NVIDIA Cosmos: A Case Study in Strategic Repository Deprecation

How Ripgrep Makes Searching 10x Faster Than Grep: A Deep Dive Into Rust-Powered Text Search

Open Interpreter: Running GPT-4 with Root Access to Your Machine

Accomplish: Why Wrapping OpenCode Instead of Building an Agent Runtime Was the Right Bet

NVIDIA Cosmos: A Case Study in Strategic Repository Deprecation

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]