Hakrawler: The Unix Philosophy Applied to Web Reconnaissance

Hook

While most crawlers try to do everything, hakrawler does one thing exceptionally well: it treats URLs like text streams, making it the grep of web reconnaissance.

Context

Bug bounty hunters and penetration testers face a recurring problem: discovering all accessible endpoints in a web application before competitors do. Traditional crawlers like Burp Suite's spider are powerful but slow and GUI-dependent. Wget and curl offer scripting capabilities but lack web-specific intelligence—they don't understand JavaScript file references, form actions, or modern HTML5 attributes. The reconnaissance phase of security testing became a bottleneck, with researchers either waiting hours for comprehensive crawls or missing critical attack surface with lightweight tools.

Hakrawler emerged from this gap in 2019, built by bug bounty hunter Luke Stephens (hakluke) who needed something fast enough for initial reconnaissance but smart enough to catch the endpoints that matter. Rather than competing with feature-rich crawlers, it embraced the Unix philosophy: do one thing well and compose with other tools. By reading URLs from stdin and writing discoveries to stdout, hakrawler slots perfectly into security pipelines between subdomain enumeration tools and content probers, turning what used to be manual workflows into automated attack surface mapping.

Technical Insight

Hakrawler's architecture is deceptively simple: it's a thin wrapper around Gocolly that strips away complexity in favor of pipeline composability. The tool spawns concurrent goroutines to process URLs in parallel, with each goroutine running an independent Colly collector configured with user-specified depth limits, scope boundaries, and authentication headers. Here's the canonical usage pattern that demonstrates its stdin/stdout design:

echo https://example.com | hakrawler -depth 3 -plain | httpx -silent

This one-liner crawls example.com three levels deep, outputs clean URLs (no JSON wrapper), and immediately probes each discovered endpoint for live responses. The pipeline pattern extends naturally to complex workflows. A typical bug bounty reconnaissance chain might look like this:

cat subdomains.txt | \
  httpx -silent -follow-redirects | \
  hakrawler -subs -depth 2 -proxy http://127.0.0.1:8080 | \
  grep -E '\.js$' | \
  anti-burl | \
  tee js-files.txt

This pipeline validates live subdomains, crawls each while allowing subdomain traversal, routes traffic through Burp Suite for passive inspection, filters for JavaScript files, deduplicates them, and saves results—all streaming in real-time without intermediate file storage.

Under the hood, hakrawler leverages Gocolly's callback system to extract URLs from multiple sources during each page visit. It doesn't just parse anchor tags; it examines script tags for src attributes, form actions, iframe sources, and even inline JavaScript for string literals matching URL patterns. The scope control is particularly clever—when you specify -subs, it extracts the root domain from the seed URL and uses Go's standard library URL parsing to validate whether discovered links share the same parent domain:

// Simplified from hakrawler's scope logic
seedURL, _ := url.Parse(startURL)
seedDomain := seedURL.Hostname()

c.OnHTML("a[href]", func(e *colly.HTMLElement) {
  link := e.Attr("href")
  absoluteURL := e.Request.AbsoluteURL(link)
  parsedURL, _ := url.Parse(absoluteURL)
  
  if subsIncluded {
    // Allow any subdomain of the seed domain
    if strings.HasSuffix(parsedURL.Hostname(), seedDomain) {
      e.Request.Visit(link)
    }
  } else {
    // Exact domain match only
    if parsedURL.Hostname() == seedDomain {
      e.Request.Visit(link)
    }
  }
})

The threading model uses Gocolly's async capabilities with a configurable parallelism limit. Setting -t 5 doesn't mean five threads in the traditional sense—Go's goroutines make this far more efficient. Instead, it limits concurrent HTTP requests, preventing server overload while maintaining fast crawl speeds. Each goroutine maintains its own cookie jar and request context, which matters when crawling applications with session-based content.

One underappreciated feature is the -headers flag for authenticated crawling. Many modern applications hide functionality behind authentication, and hakrawler handles this elegantly:

echo https://app.example.com/dashboard | \
  hakrawler -headers "Cookie: session=abc123; XSRF-TOKEN=xyz789" -depth 3

The custom headers are attached to every request in the crawl, letting you map authenticated attack surface. This is where hakrawler shines compared to unauthenticated crawlers—you can export your browser's authenticated session, paste it into the headers flag, and discover admin panels, API endpoints, and privileged functionality that public crawlers would never see.

The output format offers two modes: plain text URLs (one per line) or JSON with metadata. The JSON mode includes the URL source, whether it's from JavaScript, and the discovery path, which helps prioritize testing. Endpoints found in JavaScript files often indicate API routes or AJAX calls that may have weaker input validation than user-facing pages.

Gotcha

The most common failure mode is the silent subdomain redirect. If you crawl example.com and it redirects to www.example.com, hakrawler stops dead unless you remembered the -subs flag. There's no warning, no error message—just zero output. This has confused countless first-time users who assume the tool is broken. The issue stems from Gocolly's default behavior of not following cross-domain redirects, which is security-conscious but not documented prominently in hakrawler's help text. Always use -subs unless you specifically need exact-domain-only crawling.

The JavaScript limitation is more fundamental. Hakrawler parses JavaScript files as text, using regex to find URL-like strings. It doesn't execute JavaScript, which means single-page applications built with React, Vue, or Angular will yield incomplete results. Dynamic routes constructed at runtime, URLs fetched from API calls, and client-side routing won't be discovered. A heavily SPAified application might expose only /app.js and index.html to hakrawler, while the actual application has dozens of routes. For these cases, you need headless browser crawlers like Katana or Burp Suite with Chromium rendering. The depth limitation can also be misleading—depth 2 means two clicks from the starting URL, not two directory levels in the path structure, which catches users who think they're getting deep filesystem traversal.

Verdict

Use hakrawler if you're building automated reconnaissance pipelines for bug bounties or pentests, especially when you need fast endpoint discovery across many targets. It excels when composed with other tools in bash scripts, when crawling server-rendered applications with traditional HTML navigation, or when you need to proxy traffic through Burp Suite for passive analysis while mapping attack surface. The stdin/stdout design makes it perfect for horizontal scaling across hundreds of domains. Skip it if you're crawling JavaScript-heavy SPAs where most content renders client-side—you need headless browser execution instead. Also skip it for comprehensive site mapping or SEO analysis where you need complete coverage; hakrawler optimizes for speed over completeness. If you're working on a single target interactively rather than automating across many targets, a GUI crawler with session recording will serve you better.

Hakrawler: The Unix Philosophy Applied to Web Reconnaissance

Hakrawler: The Unix Philosophy Applied to Web Reconnaissance

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

Hakrawler: The Unix Philosophy Applied to Web Reconnaissance

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// RELATED

Caldera: When Your Red Team Needs a Planning Algorithm, Not Just Another C2

Caldera: Building Adversary Emulation with Fact-Based Planning Engines

Inside Mathias Bynens' Dotfiles: The Blueprint for 30,000 macOS Developer Environments

Glow: Why Rendering Markdown in the Terminal Shouldn't Require a Browser

Caldera: When Your Red Team Needs a Planning Algorithm, Not Just Another C2

Caldera: Building Adversary Emulation with Fact-Based Planning Engines

Inside Mathias Bynens' Dotfiles: The Blueprint for 30,000 macOS Developer Environments

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]