unfurl: The Bug Bounty Hunter's Secret Weapon for URL Parsing
Hook
While most developers reach for Python's urllib or regex hacks to parse URLs, security researchers processing millions of URLs during reconnaissance know there's a faster way: a 500-line Go tool that does one thing perfectly.
Context
If you've ever worked in bug bounty hunting, security reconnaissance, or large-scale web scraping, you know the pattern: you've just enumerated thousands of URLs from a target domain using tools like waybackurls, gau, or hakrawler. Now you need to answer questions like "Which subdomains appear most frequently?" or "What query parameters are being used?" or "Which paths contain the word 'admin'?"
Traditionally, developers solve this with brittle awk one-liners, complex sed expressions that break on edge cases, or Python scripts that require virtual environments and dependency management. The problem isn't just complexity—it's speed and reliability. When you're processing millions of URLs in a CI/CD pipeline or during time-sensitive security assessments, every second counts, and failures aren't acceptable. Tom Hudson (tomnomnom) built unfurl to solve this exact problem: a zero-dependency, single-binary tool that treats URLs as structured data and lets you extract, transform, and deduplicate components with Unix-pipeline elegance.
Technical Insight
unfurl's architecture is deceptively simple: it reads URLs line-by-line from stdin, parses them using Go's standard net/url package, and outputs specified components. But the magic lies in its format string system and domain decomposition logic.
At its core, unfurl supports two modes: predefined keys and custom format strings. The predefined keys (domains, paths, keys, values) handle common use cases. For example, extracting all unique domains from a URL list:
cat urls.txt | unfurl -u domains
# Input: https://api.example.com/v1/users?id=123
# Output: api.example.com
The -u flag is crucial here—it deduplicates output without requiring sort | uniq, maintaining streaming performance. This matters when processing gigabyte-sized URL lists where sorting would require loading everything into memory.
Where unfurl truly shines is custom format strings. These use percent-encoded directives similar to printf, allowing arbitrary component extraction and recombination:
cat urls.txt | unfurl format %s://%d%p
# Input: https://api.example.com/v1/users?id=123&token=abc
# Output: https://api.example.com/v1/users
cat urls.txt | unfurl format %d %p
# Outputs: api.example.com /v1/users
cat urls.txt | unfurl -u format %P
# Extracts unique protocols (http, https, ftp, etc.)
The format directives include %s (scheme), %u (user), %d (domain), %S (subdomain), %r (root domain), %t (TLD), %P (port), %p (path), %q (query string), %f (fragment), and %@ (authority). This composability eliminates the need for complex regex or multiple pipeline stages.
The domain decomposition logic is particularly interesting. unfurl splits domains into subdomain, root, and TLD components using heuristics rather than the Public Suffix List. Looking at the source code, it identifies TLDs by splitting on dots and assuming the last segment is the TLD, the second-to-last is the root domain, and everything else is the subdomain. This works for 95% of cases:
echo "https://admin.api.example.com/path" | unfurl format "Sub: %S, Root: %r, TLD: %t"
# Output: Sub: admin.api, Root: example, TLD: com
However, this simplistic approach fails with multi-level TLDs like .co.uk or .com.br, which it treats as single TLDs. For security reconnaissance where you care more about speed and pattern recognition than perfect accuracy, this tradeoff is acceptable.
The JSON output mode (-j) transforms each URL into a structured JSON object, enabling powerful combinations with jq:
cat urls.txt | unfurl -j | jq -r 'select(.path | contains("api")) | .domain'
# Filters URLs with 'api' in path, outputs only domains
cat urls.txt | unfurl -j | jq -r '.query | select(. != null)' | unfurl keys
# Extracts all query parameter keys from URLs that have query strings
Implementation-wise, unfurl processes URLs synchronously in a single goroutine. While this seems like a missed opportunity for concurrency, URL parsing is typically I/O-bound (reading from stdin) rather than CPU-bound, and the simplicity avoids goroutine scheduling overhead and ensures deterministic output order—critical when deduplication with -u depends on first-seen ordering.
One clever detail: unfurl uses a map-based deduplication strategy when -u is enabled, storing seen values and only outputting new ones. This trades memory for speed (O(n) memory for n unique values) but performs dramatically better than the traditional sort | uniq approach which requires O(n log n) time and full materialization of the input.
Gotcha
unfurl's simplicity comes with real limitations that will bite you in production scenarios. The domain parsing, while fast, uses naive dot-splitting that categorizes api.example.co.uk incorrectly—it treats uk as the TLD and co as part of the root domain. If you're working with international domains or need legally accurate domain ownership boundaries, you'll need to post-process with a Public Suffix List library.
Error handling is minimal and silent by default. Malformed URLs that can't be parsed by Go's net/url package are skipped without warning unless you enable verbose mode with -v. This creates a silent failure mode where you might not realize you're missing data. In one real-world scenario, a security researcher missed an entire subdomain because URLs with unencoded spaces were being silently dropped. Always run a sample of your input through with -v to validate parsing behavior.
The tool also has no concept of URL normalization. https://example.com/path and https://example.com/path/ are treated as different paths, as are https://example.com?a=1&b=2 and https://example.com?b=2&a=1. If you need canonicalized URLs for deduplication, you'll need to normalize before piping to unfurl. The memory usage during deduplicated operations can also surprise you—processing a 10GB URL list with -u will consume gigabytes of RAM to store the deduplication map, potentially causing OOM kills in containerized environments.
Verdict
Use if: You're building security reconnaissance pipelines, processing large URL datasets in shell scripts, or need fast extraction of URL components without Python/Node.js dependencies. It's perfect for bug bounty workflows, analyzing web server logs, or any scenario where you're combining Unix tools to answer questions about URL patterns. The single binary with zero dependencies makes it ideal for Docker containers, CI/CD pipelines, and air-gapped environments. Skip if: You need internationalized domain name (IDN) support, accurate multi-level TLD parsing (use a Public Suffix List library instead), or you're already working in Python/JavaScript where native URL parsing libraries provide better error handling and normalization. Also skip it if you need concurrent processing of truly massive datasets—in those cases, a custom Go program using goroutines or a streaming data tool like Apache Beam will perform better.