Hakrawler: The Unix Philosophy Applied to Web Reconnaissance
Hook
While modern crawlers race to add headless browsers and JavaScript rendering, hakrawler proved that 5,000+ security researchers just needed a tool that reads from stdin and writes to stdout.
Context
Web reconnaissance in bug bounty and penetration testing traditionally meant opening Burp Suite, configuring a spider, waiting hours for results, and then manually extracting URLs. The workflow broke down when you needed to crawl hundreds of domains—enterprise tools weren’t designed for mass automation, and writing custom scrapers meant dealing with HTTP clients, HTML parsing, concurrency, and all the edge cases that come with web crawling.
The Go security tooling ecosystem solved this by embracing Unix philosophy: small, composable tools that do one thing well. Tools like subfinder find subdomains, httpx probes for live hosts, and nuclei tests for vulnerabilities—all reading from stdin and writing to stdout. Hakrawler emerged to fill the crawling gap in this pipeline, built as an implementation of the Gocolly library with a stdin/stdout interface and exactly the features reconnaissance workflows need: depth control, subdomain handling, and proxy support. It’s not trying to be a comprehensive crawler; it’s designed to be the middle piece in a chain of specialized tools.
Technical Insight
Hakrawler’s design centers on accepting URLs via stdin, crawling them with configurable options, and outputting discovered endpoints to stdout. Based on the README, it’s described as “a simple implementation of the awesome Gocolly library,” wrapping that framework with command-line flags for security testing workflows.
The tool accepts URLs line-by-line from stdin and crawls each one:
# Single URL
echo https://google.com | hakrawler
# Multiple URLs
cat urls.txt | hakrawler
The key design decision is the -subs flag, which controls whether subdomains are included in the crawl scope. This solves a common reconnaissance scenario: you start with example.com, but it redirects to www.example.com. Without -subs, the crawler considers www.example.com out of scope and returns nothing. With -subs, it follows the redirect and crawls the subdomain:
# Returns nothing if example.com redirects to www.example.com
echo https://example.com | hakrawler
# Follows subdomains and crawls properly
echo https://example.com | hakrawler -subs
The README explicitly documents this redirect issue: “a common issue is that the tool returns no URLs. This usually happens when a domain is specified (https://example.com), but it redirects to a subdomain (https://www.example.com). The subdomain is not included in the scope, so no URLs are printed.”
Concurrency is controlled via the -t flag for number of threads (default 8). When piping multiple URLs through stdin, each gets crawled with the specified thread count, allowing parallel processing of many targets.
The timeout mechanism uses the -timeout flag to set a per-URL time limit. If one domain hangs, hakrawler moves to the next stdin line after the timeout expires:
# Each domain gets 5 seconds max before moving to next
cat urls.txt | hakrawler -timeout 5
Proxy support integrates with intercepting proxies like Burp Suite:
cat urls.txt | hakrawler -proxy http://localhost:8080
The -s flag shows the source of each URL (href, form, script, etc.), helping identify where endpoints were discovered. The -json flag outputs results as JSON for programmatic processing. The -u flag shows only unique URLs, reducing duplicate output.
Depth control via -d (default 2) determines how many link layers deep the crawler follows. The -i flag restricts crawling to only inside the specified path. Custom headers can be passed with -h, and TLS verification can be disabled with -insecure.
The tool appears designed for speed and composability rather than deep analysis—it crawls HTML responses and extracts links, but doesn’t render JavaScript or execute client-side code. This makes it suitable for the first pass in reconnaissance pipelines where you’re chaining multiple specialized tools together.
Gotcha
The biggest limitation is likely JavaScript-heavy applications. Since hakrawler is described as crawling for “URLs and JavaScript file locations” by parsing responses, it presumably won’t execute JavaScript to discover dynamically-rendered content. Modern single-page applications that render content client-side would return minimal results.
The subdomain redirect issue is explicitly documented in the README and catches users frequently. You run hakrawler against a domain, get zero results, and assume the tool is broken. The actual problem is that the domain redirected to a subdomain that’s out of scope. The README states: “This usually happens when a domain is specified (https://example.com), but it redirects to a subdomain (https://www.example.com). The subdomain is not included in the scope, so no URLs are printed.” The fix is always the same: add -subs or specify the final redirect destination.
The Kali Linux apt package installation is explicitly flagged in the README with a warning: “Note: This will install an older version of hakrawler without all the features, and it may be buggy. I recommend using one of the other methods.” This is unusual—the maintainer recommends against using the package manager version in favor of Go installation or Docker.
Depth control with the -d flag can be confusing. The default depth of 2 means it crawls the starting URL and two levels of links deep. For large sites, increasing depth can explode into many thousands of requests. Start shallow and increase depth only when needed.
The tool provides thread control via -t but doesn’t appear to have built-in rate limiting beyond that. Pointing it at targets with high thread counts could trigger rate limits or protection mechanisms. The tool assumes you have permission to scan your targets.
Page size limits via -size (default -1, meaning unlimited) could cause memory issues on sites with very large pages if not configured appropriately.
Verdict
Use hakrawler if you’re building bug bounty or penetration testing workflows that need fast, automated URL discovery across many domains. It excels in Unix pipelines where you’re chaining tools together (like the example: echo google.com | haktrails subdomains | httpx | hakrawler), and you need each component to be fast and reliable. Choose it when you value speed and composability over deep JavaScript analysis, and when you’re targeting sites that serve HTML responses with links to discover. Skip hakrawler if you’re targeting modern single-page applications that render content client-side—based on its design as a simple Gocolly wrapper, you’ll likely get minimal results. Also skip it if you need advanced features beyond URL discovery—the tool is focused on finding endpoints, not comprehensive application testing. For those cases, consider more full-featured crawling solutions with headless browser support or authentication capabilities. The tool’s simplicity and Unix philosophy design make it excellent for reconnaissance pipelines, but that same focused approach means it’s not suitable for deep application analysis.