Back to Articles

page-fetch: Building a JavaScript Supply Chain Scanner with Headless Chrome

[ View on GitHub ]

page-fetch: Building a JavaScript Supply Chain Scanner with Headless Chrome

Hook

While curl fetches what the server sends, modern web pages load dozens of JavaScript files dynamically after the initial HTML loads. Traditional HTTP clients miss the complete picture of what a browser actually loads.

Context

Traditional HTTP clients like curl and wget operate at the protocol level—they fetch exactly what the server responds with and nothing more. But modern web applications are orchestration layers: the initial HTML is just a bootstrap that triggers cascading requests for JavaScript bundles, stylesheets, fonts, tracking pixels, and third-party integrations. If you’re researching how a site actually behaves, analyzing its JavaScript dependencies, or testing for client-side vulnerabilities, you need to see what a real browser sees.

page-fetch emerged from the security research team at Detectify to solve this exact problem. Security researchers needed a way to systematically capture and analyze JavaScript loaded across thousands of domains—not just for pentesting individual sites, but for understanding patterns in third-party integrations, supply chain dependencies, and client-side attack vectors at scale. The tool bridges the gap between browser DevTools (powerful but manual) and command-line HTTP clients (scriptable but blind to dynamic content). It wraps the Chrome DevTools Protocol in a Unix-friendly CLI that accepts URLs on stdin and orchestrates headless Chrome instances to capture everything a page loads, storing it in an analyzable directory structure.

Technical Insight

intercepts

headers, status

body content

stdin URLs

URL Queue

Chrome Pool Manager

Headless Chrome 1

Headless Chrome 2

Headless Chrome N

chromedp Controller

Network Interceptor

Resource Fetcher

Response Handler

Disk Writer

Resource Files

.meta Files

JS Executor

Serialized Results

System architecture — auto-generated

At its core, page-fetch is a Go wrapper around chromedp, a high-level library for driving Chrome via the DevTools Protocol. The architecture is elegantly simple: it reads URLs from stdin, spawns a pool of headless Chrome instances based on the concurrency flag (defaulting to 2), and intercepts every network request the browser makes. Each fetched resource gets written to disk in a directory structure that mirrors the URL path, alongside a .meta file containing the full request/response metadata including headers, status codes, and the original URL with query parameters.

What makes this powerful is the interception layer. Unlike running Chrome manually and inspecting the Network tab, page-fetch programmatically captures all resources:

echo https://detectify.com | page-fetch
# Output:
GET https://detectify.com/ 200 text/html; charset=utf-8
GET https://detectify.com/site/themes/detectify/css/detectify.css?v=1621498751 200 text/css
GET https://fonts.googleapis.com/css?family=Merriweather:300i 200 text/css; charset=utf-8

The directory structure makes post-processing trivial. Want to analyze all JavaScript files across 100 domains? Run page-fetch, then use standard Unix tools:

cat urls.txt | page-fetch --include application/javascript
find out/ -name '*.js' -type f | xargs grep -l 'eval('

The JavaScript execution capability transforms page-fetch from a passive observer into an active interrogator. The -j flag runs arbitrary JavaScript in the page context and serializes the return value. Want to extract all third-party script sources from 1000 pages?

cat domains.txt | page-fetch --javascript '[...document.querySelectorAll("script[src]")].map(s => s.src).filter(u => !u.includes(document.domain))' | grep ^JS

This returns structured data you can pipe into further analysis. The JavaScript runs after the page fully loads, so you’re querying the live DOM, not static HTML. You can access localStorage, extract JWT tokens, examine shadow DOM elements, or trigger client-side functions to see how they behave.

The filtering options (--include, --exclude, --no-third-party) let you scope your captures. Security researchers often run page-fetch with --include application/javascript to build a corpus of client-side code, then grep for patterns like API keys, outdated libraries, or insecure implementations. The --proxy flag pipes everything through Burp Suite or similar tools, letting you inspect or modify traffic:

echo https://example.com | page-fetch --proxy http://localhost:8080

One underappreciated design choice: chromedp automatically searches for Chrome executables across multiple naming conventions (chromium, google-chrome, headless-shell, etc.). This reduces friction across different environments—your research script that works on macOS with google-chrome-stable will also work on Ubuntu with chromium-browser without modification. The tool respects the Unix philosophy: do one thing well, accept input on stdin, produce parseable output, and compose with other tools. You can pipe URLs from SQL queries, filter with grep, parallelize with xargs, and store results however you want.

Gotcha

The biggest limitation is resource consumption. Each concurrent instance is a full Chromium browser process with associated memory and CPU overhead for rendering and JavaScript execution. The default concurrency of 2 is conservative for good reason. Running at high concurrency means substantial RAM usage. If you’re analyzing thousands of URLs, you’ll need patience or serious hardware. Unlike lightweight HTTP clients that can handle hundreds of concurrent connections, page-fetch trades throughput for completeness.

Authentication and session management require manual work. There’s no built-in cookie jar, no login flow helpers, no session persistence across URLs. If you need to scan authenticated pages, you’ll need to handle cookies yourself—either by modifying chromedp code or using the proxy option to inject session tokens via an intercepting proxy. The tool is designed as a research instrument built by engineers for engineers. You may need to read the source code for advanced use cases or edge case behaviors.

Verdict

Use page-fetch if you’re doing security research on client-side code, analyzing JavaScript supply chains across multiple domains, or need to capture the complete resource graph of modern web applications. It’s perfect for building datasets of third-party integrations, extracting data via DOM queries at scale, or automating browser-based reconnaissance where traditional HTTP clients fall short. The JavaScript execution feature makes it invaluable for systematic analysis that would otherwise require manual clicking through DevTools on hundreds of sites. Skip it if you’re just scraping static content (use curl or wget instead), need high-throughput crawling of thousands of pages (the resource overhead won’t scale), require sophisticated authentication flows without building custom solutions, or want a tool with extensive built-in session management. This is a power tool for researchers who understand the trade-offs and are comfortable working at the command line.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/automation/detectify-page-fetch.svg)](https://starlog.is/api/badge-click/automation/detectify-page-fetch)