Back to Articles

Building a Web Archive with page-fetch: Headless Chrome for Security Research

[ View on GitHub ]

Building a Web Archive with page-fetch: Headless Chrome for Security Research

Hook

Most web scraping tools see only the HTML skeleton of modern web apps—missing the JavaScript files, API calls, and dynamically loaded content that contain the real attack surface.

Context

Security researchers and bug bounty hunters face a fundamental problem: modern web applications don't reveal themselves through simple HTTP requests. A typical single-page application might load dozens of JavaScript bundles, make multiple API calls, and dynamically generate content that never appears in the initial HTML response. Traditional tools like curl or wget capture only that first HTML document—essentially a blank canvas before the browser paints the real application.

This gap matters immensely for security work. To find vulnerabilities, you need to see what the browser sees: every loaded script, every XHR request, every resource pulled from CDNs or third-party domains. You need to execute JavaScript in the page context to trigger authentication flows, extract CSRF tokens, or probe how the application handles edge cases. The Detectify team built page-fetch to solve exactly this problem—wrapping headless Chrome in a concurrent, filesystem-friendly tool that captures the complete browser experience of a web page, not just its initial payload.

Technical Insight

At its core, page-fetch is a Go wrapper around chromedp that transforms Chrome's DevTools Protocol into a resource archival system. The architecture is elegantly simple: read URLs from stdin, spawn headless Chrome instances (with configurable concurrency), navigate to each page, intercept all network requests through Chrome's Network domain, and write responses to disk in a hierarchical structure that mirrors the URL path.

The interesting engineering decision is how it handles resource interception. Rather than scraping the DOM or parsing links, page-fetch hooks into Chrome's request lifecycle before resources even load. This means it captures everything the browser fetches—including resources you might not know exist, like dynamically imported JavaScript modules or assets loaded by third-party scripts. Here's a basic usage pattern:

# Fetch a page and save all resources
echo "https://example.com" | page-fetch -output ./archive

# The output directory mirrors the URL structure:
# archive/example.com/index.html
# archive/example.com/assets/app.js
# archive/example.com/assets/app.js.meta  # HTTP headers and metadata

The metadata files are particularly clever for forensic work. Each fetched resource gets a companion .meta file containing the original URL, response headers, and status code—crucial for understanding CDN behaviors, caching policies, or tracking how resources change over time.

Where page-fetch truly shines is JavaScript execution. You can inject arbitrary code into the page context and extract return values, enabling sophisticated data extraction from JavaScript-heavy applications:

# Execute JavaScript and capture the result
echo "https://example.com" | page-fetch \
  -output ./archive \
  -javascript "document.querySelector('.csrf-token').value"

# Or extract complex objects
echo "https://app.example.com" | page-fetch \
  -javascript "JSON.stringify(window.__INITIAL_STATE__)"

This JavaScript execution happens after the page fully loads (respecting the configurable timeout), so you're working with the complete rendered application state. For security research, this means you can extract authentication tokens, probe exposed API endpoints in JavaScript globals, or trigger specific application states before archiving.

The concurrency model uses goroutines to manage multiple Chrome instances simultaneously. Each URL gets processed independently, which is perfect for bulk analysis of many domains but requires careful resource management. Chrome instances are memory-hungry—the default concurrency of 5 is conservative by design. The implementation uses a worker pool pattern where each worker maintains its own Chrome context:

# Process 100 URLs with controlled concurrency
cat urls.txt | page-fetch -concurrency 3 -output ./crawl

For security researchers integrating page-fetch into larger workflows, the proxy support is essential. Pointing traffic through Burp Suite or ZAP allows you to analyze requests, modify responses, or identify potential injection points:

# Route all Chrome traffic through Burp Suite
echo "https://target.com" | page-fetch \
  -proxy http://127.0.0.1:8080 \
  -output ./analysis

The content-type filtering is another practical feature for focused research. If you're analyzing JavaScript files for vulnerable libraries, you can exclude images and stylesheets to reduce noise:

# Only capture JavaScript and HTML
echo "https://app.com" | page-fetch \
  -include-content-type "text/html,application/javascript" \
  -output ./js-analysis

Gotcha

The biggest operational challenge with page-fetch is managing the Chrome dependency. Unlike pure Go tools that compile to a single static binary, you must have Chrome or Chromium installed on every system where page-fetch runs. This creates deployment friction—Docker containers need Chrome packages, CI/CD pipelines require additional setup steps, and version mismatches between page-fetch expectations and installed Chrome can cause cryptic failures. The tool doesn't bundle Chrome or provide version checking, so you're responsible for ensuring compatibility.

File naming conflicts reveal another rough edge. When multiple resources map to the same filesystem path (common with query parameters or URL fragments), page-fetch appends numeric suffixes: resource.js, resource.js.1, resource.js.2. This prevents data loss but makes it difficult to correlate resources with their original URLs without parsing the .meta files. For large-scale archival, you'll likely need post-processing scripts to organize captures. Rate limiting is also absent—there's a basic delay flag but no intelligent backoff or respect for robots.txt, which can get your IP blocked during aggressive crawling of production sites.

Verdict

Use if: You're doing security research that requires seeing the complete browser-rendered state of web applications, need to capture all resources including dynamically loaded JavaScript for vulnerability analysis, want to execute JavaScript in page context to extract data or tokens, or are building automated workflows that need to archive modern SPAs with all their dependencies. The Go binary and stdin interface make it perfect for Unix pipelines and integration with other security tools. Skip if: You only need simple HTML scraping (curl is faster and simpler), can't accommodate Chrome as a system dependency in your deployment environment, need sophisticated crawling logic like link following or form submission (use Puppeteer/Playwright instead), or require fine-grained rate limiting and session management for large-scale crawling operations.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/automation/detectify-page-fetch.svg)](https://starlog.is/api/badge-click/automation/detectify-page-fetch)