gau: Mining Web Archives for Security Reconnaissance Without Touching the Target

Hook

The endpoint that earned a security researcher a $10,000 bug bounty wasn't on the current website—it was buried in a Wayback Machine snapshot from 2019, forgotten but still functional. This is why passive reconnaissance tools like gau exist.

Context

Traditional web reconnaissance faces a fundamental tension: you need to discover as many endpoints as possible, but aggressive crawling triggers rate limits, WAFs, and alerts your target that someone is poking around. Active crawlers that spider live websites are noisy, slow, and easily detected. They also miss a critical attack surface—endpoints that existed in the past but are no longer linked from the current site. These forgotten APIs, staging environments, debug endpoints, and old admin panels often remain functional long after being delisted, creating security blind spots.

Web archives like the Wayback Machine, Common Crawl, and AlienVault's Open Threat Exchange have been indexing the internet for years, capturing snapshots of websites as they evolved. Each archive has different coverage: Wayback Machine excels at long-term historical data, Common Crawl provides massive breadth, URLScan offers recent security-focused snapshots, and AlienVault surfaces URLs from threat intelligence feeds. Manually querying each source is tedious and time-consuming. gau (short for "get all URLs") solves this by aggregating results from multiple archives in parallel, giving security researchers a comprehensive view of a target's historical attack surface without sending a single request to the target itself.

Technical Insight

System architecture — auto-generated

gau's architecture is refreshingly straightforward: it's a single-binary CLI tool that spawns concurrent workers to query different archive APIs, deduplicates results, applies filters, and streams output. The core design leverages Go's concurrency primitives to parallelize what would otherwise be sequential, rate-limited API calls.

When you run gau example.com, the tool spins up separate goroutines for each provider. Each provider implements a common interface with a Fetch method that returns a channel of URLs. This design allows providers to stream results as they arrive rather than waiting for complete responses:

// Simplified provider interface concept
type Provider interface {
    Fetch(domain string, config Config) <-chan string
}

// Example usage pattern
func fetchFromProviders(domain string, providers []Provider) <-chan string {
    results := make(chan string)
    var wg sync.WaitGroup
    
    for _, provider := range providers {
        wg.Add(1)
        go func(p Provider) {
            defer wg.Done()
            for url := range p.Fetch(domain, config) {
                results <- url
            }
        }(provider)
    }
    
    go func() {
        wg.Wait()
        close(results)
    }()
    
    return results
}

This fan-out pattern means gau can query Wayback Machine, Common Crawl, AlienVault, and URLScan simultaneously. If one provider is slow or timing out, the others continue streaming results. The tool uses worker pools to control concurrency—you can adjust the number of concurrent requests with --threads to balance speed against API rate limits.

The filtering system is where gau really shines for practical security work. You can exclude specific extensions to ignore static assets (--blacklist png,jpg,css,js), filter by HTTP status codes to focus on successful responses (--fc 200,301), specify date ranges to see what changed in a particular timeframe, or remove duplicate URLs that differ only in parameter values with --subs. Here's a real-world reconnaissance workflow:

# Find all endpoints for a target, excluding common static files
gau example.com --blacklist png,jpg,gif,css,js,woff,woff2,ttf,svg

# Focus on potentially interesting endpoints from the last year
gau example.com --from 202301 --to 202401 | grep -E '(api|admin|internal)'

# Find URLs with parameters (potential injection points)
gau example.com --subs | grep '?' > params.txt

# Output JSON for further processing
gau example.com --json | jq -r 'select(.statuscode == 200) | .url'

The --subs flag is particularly clever for parameter fuzzing. Many URLs in archives differ only in parameter values: /search?q=foo, /search?q=bar, /search?q=baz. For security testing, you typically only need one example of each endpoint structure. The parameter deduplication keeps /search?q=foo and discards the rest, dramatically reducing noise.

gau also supports reading domains from stdin, making it trivial to scale across multiple targets. Combined with tools like subfinder for subdomain enumeration, you can build powerful reconnaissance pipelines:

# Enumerate subdomains, then fetch all historical URLs for each
subfinder -d example.com -silent | gau --threads 5 | sort -u > all_urls.txt

The configuration file support (~/.gau.toml) lets you persist API keys and default settings. Some providers like URLScan and GitHub (for Common Crawl index files) work better with authentication, increasing rate limits and result quality. The TOML format is simple and self-documenting, avoiding the complexity of environment variables or command-line flags for permanent settings.

Gotcha

gau's biggest limitation is that it tells you what URLs existed historically, not whether they still work. You'll get thousands of URLs, but many will return 404s when you actually test them. This is by design—gau focuses on discovery, not validation—but it means you need a second tool to verify which endpoints are still live. Pairing gau with something like httpx is almost mandatory for real-world workflows.

The tool is also entirely dependent on external APIs. If the Wayback Machine is having a slow day (not uncommon), your results will trickle in slowly. If Common Crawl changes their index format or URLScan updates their API, gau may break until maintainers release a fix. There's no local caching mechanism, so running the same query twice means hitting the same APIs twice. For large domains with extensive histories, this can mean waiting several minutes for complete results.

Finally, the filtering happens client-side after fetching URLs from providers. If a domain has millions of archived URLs, gau still downloads all of them before applying your blacklist or status code filters. This isn't usually a performance problem thanks to streaming, but it means you're always limited by API response times rather than doing intelligent filtering at the source. The --verbose flag helps you understand what's happening, but watching thousands of URLs stream by while waiting for the few interesting ones can be tedious.

Verdict

Use if: You're doing security reconnaissance, bug bounty hunting, or penetration testing where discovering forgotten endpoints, old API versions, or hidden administrative interfaces matters. gau excels at passive enumeration without alerting your target, and the multi-source aggregation provides coverage that no single archive can match. It's particularly valuable when you need to understand how a target's attack surface has evolved over time or when you're hunting for endpoints that are no longer linked but still functional. Skip if: You need real-time crawling of current website structure, require validation that URLs are still active, or are only interested in a single archive source (use waybackurls instead for simplicity). Also skip if you're working with targets that have minimal web history—newly launched sites won't have much archived data to discover.

gau: Mining Web Archives for Security Reconnaissance Without Touching the Target

gau: Mining Web Archives for Security Reconnaissance Without Touching the Target

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

gau: Mining Web Archives for Security Reconnaissance Without Touching the Target

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// RELATED

Caldera: When Your Red Team Needs a Planning Algorithm, Not Just Another C2

Caldera: Building Adversary Emulation with Fact-Based Planning Engines

Inside Mathias Bynens' Dotfiles: The Blueprint for 30,000 macOS Developer Environments

Glow: Why Rendering Markdown in the Terminal Shouldn't Require a Browser

Caldera: When Your Red Team Needs a Planning Algorithm, Not Just Another C2

Caldera: Building Adversary Emulation with Fact-Based Planning Engines

Inside Mathias Bynens' Dotfiles: The Blueprint for 30,000 macOS Developer Environments

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]