gau: Mining Four Internet Archives at Once for Security Reconnaissance
Hook
The average web application has been crawled and archived dozens of times over its lifetime, creating a permanent record of endpoints, parameters, and API routes that may no longer be protected—or even remembered by current developers.
Context
Before gau, security researchers conducting reconnaissance had to manually query multiple archive services: the Internet Archive’s Wayback Machine for historical snapshots, Common Crawl’s petabyte-scale web index, AlienVault’s Open Threat Exchange for threat intelligence data, and URLScan for recent scans. Each service required different API calls, authentication methods, and result parsing logic. Tools like Tomnomnom’s waybackurls solved part of this problem by automating Wayback Machine queries, but left researchers writing custom scripts to aggregate data from other sources.
gau (getallurls) consolidates these four providers into a single CLI tool optimized for security workflows. It’s designed for the reconnaissance phase of penetration testing and bug bounty hunting, where discovering forgotten admin panels, deprecated API endpoints, or leaked parameter names can be the difference between a surface-level assessment and finding critical vulnerabilities. The tool embraces Unix philosophy: it accepts domain lists via stdin, processes them concurrently, and outputs clean URL lists to stdout for piping into vulnerability scanners, content discovery tools, or custom analysis scripts.
Technical Insight
At its core, gau implements a concurrent provider pattern where each archive service is abstracted as a provider interface. When you execute gau example.com, the tool spawns goroutine workers (configurable via --threads) that query all enabled providers in parallel. The default configuration hits all four sources simultaneously:
# Query all providers with 5 concurrent workers
cat domains.txt | gau --threads 5
# Use only specific providers
gau --providers wayback,commoncrawl example.com
# Filter and deduplicate results
gau --blacklist png,jpg,gif --fp --mc 200,500 example.com
The --fp flag (filter parameters) is particularly clever for reconnaissance work. Web applications often expose the same endpoint with different parameter combinations—/api/users?id=1, /api/users?id=2&debug=true, etc. Without parameter filtering, you’d get thousands of duplicate URLs that differ only in query string values. When --fp is enabled, gau normalizes these to /api/users?id=&debug=, dramatically reducing noise while preserving the parameter structure you need for fuzzing.
The tool’s filtering system operates on multiple dimensions simultaneously. You can combine status code matching (--mc 200), MIME type filtering (--mt text/html,application/json), extension blacklisting (--blacklist woff,ttf), and date ranges (--from 202101 --to 202312) to narrow results before they even reach stdout. This is critical when dealing with Common Crawl, which can return hundreds of thousands of URLs for popular domains:
# Find all HTML/JSON endpoints from 2023 that returned 200 or 500 status codes
gau --from 202301 --to 202312 --mc 200,500 --mt text/html,application/json example.com
gau supports TOML configuration files at $HOME/.gau.toml, allowing you to persist settings across engagements. This is invaluable when you have standard filtering rules for your workflow:
# .gau.toml
threads = 10
blacklist = ["css", "woff", "ttf", "svg", "png", "jpg"]
fc = [404, 301, 302]
fp = true
proxy = "http://127.0.0.1:8080"
The --json output flag transforms the line-delimited URL list into structured JSON objects containing metadata from each provider—timestamps, status codes, MIME types. This enables more sophisticated post-processing:
gau --json example.com | jq '.[] | select(.statuscode == 200) | .url'
One architectural decision worth noting: gau streams results as they arrive rather than buffering everything in memory. When querying providers that return hundreds of thousands of URLs, this streaming approach keeps memory usage constant while allowing downstream tools to start processing results immediately. You can pipe gau directly into tools like httpx, nuclei, or ffuf without waiting for the entire dataset to download.
The proxy support (--proxy socks5://127.0.0.1:9050) integrates cleanly with operational security workflows. During penetration tests where you need to route traffic through Burp Suite or maintain anonymity via Tor, gau respects standard proxy configurations. The --retries and --timeout flags provide additional resilience when working with rate-limited APIs or unstable network conditions.
Gotcha
gau’s power comes entirely from its data sources, which means you inherit all their limitations. The Wayback Machine doesn’t archive sites that block its crawler via robots.txt, Common Crawl’s coverage is uneven across different TLDs and languages, and URLScan only contains URLs that someone manually submitted for scanning. For newly deployed applications or sites with aggressive crawler blocking, gau might return zero results while the application has hundreds of accessible endpoints.
There’s no built-in rate limiting or API key management. If you run gau against a large list of domains, you’ll likely hit rate limits from the archive providers, leading to incomplete results or temporary IP blocks. The tool doesn’t implement exponential backoff or respect HTTP 429 responses intelligently—it relies on simple retry logic that may not be sophisticated enough for production reconnaissance workflows at scale. You’ll need to implement your own throttling by processing domain lists in batches or adding delays between runs.
The oh-my-zsh conflict is a genuine annoyance for security professionals who commonly use zsh. The git plugin aliases gau to git add --update, meaning you’ll need to either disable that specific alias, use the full path /usr/local/bin/gau, or create a shell function wrapper. This isn’t a showstopper, but it’s friction in environments where zsh is standard.
Output deduplication happens across all providers, but gau doesn’t persist a cache between runs. If you’re doing iterative reconnaissance where you query the same domains multiple times with different filters, you’ll re-download the same archive data repeatedly. There’s no way to build a local cache of provider responses for offline filtering or faster subsequent queries.
Verdict
Use if you’re conducting security assessments, bug bounty reconnaissance, or attack surface enumeration where historical URL data reveals forgotten endpoints, old API versions, or leaked administrative interfaces. gau excels in passive reconnaissance phases where you want comprehensive URL discovery without actively crawling target infrastructure. It’s particularly valuable when combined with other tools in a pipeline: gau for URL discovery, httpx for validation, nuclei for vulnerability scanning. The multi-provider aggregation saves significant time compared to querying each archive service manually. Skip if you need real-time URL discovery from live crawling (use hakrawler or gospider instead), require sophisticated API key rotation and rate limit management for large-scale operations, or work primarily with modern applications deployed in the last few months where archive coverage is minimal. Also reconsider if you’re not in a security context—passive URL enumeration from archives is specifically useful for finding security-relevant historical data, not for general web development or SEO analysis where active crawling provides better results.