ZBrowse: Chrome DevTools Protocol for Internet-Scale Website Archaeology

Hook

While most developers use headless browsers to automate clicks and scrape content, security researchers need to understand how the entire web is built—one million sites at a time.

Context

The ZMap Project made headlines in 2013 by scanning the entire IPv4 address space in under 45 minutes. But understanding what's listening on port 443 is only half the battle—modern web security research requires understanding what those servers actually deliver, how websites load their resources, and what dependencies lurk in the DOM. Traditional browser automation tools like Selenium and Puppeteer excel at simulating user interactions, but they're optimized for testing individual applications, not for extracting structured data about web architecture patterns across thousands or millions of sites.

ZBrowse fills this gap by treating headless Chrome as a measurement instrument rather than an automation target. Instead of focusing on clicking buttons or filling forms, it captures the complete loading behavior of web pages: every network request, every DOM mutation, every dependency relationship. The output isn't screenshots or scraped text—it's structured JSON documents that describe how websites are architecturally assembled. This makes ZBrowse invaluable for researchers studying trends in web technologies, tracking malware distribution networks, or analyzing how privacy-invasive scripts propagate across the internet.

Technical Insight

System architecture — auto-generated

ZBrowse is fundamentally a Node.js wrapper around the Chrome DevTools Protocol (CDP), but its architecture reveals careful design decisions for measurement rigor. Unlike Puppeteer, which provides high-level abstractions for common automation tasks, ZBrowse stays closer to CDP's raw capabilities, instrumenting specific protocol domains that matter for architectural analysis.

The core instrumentation happens through CDP's Network, Page, and Runtime domains. Here's a simplified example of how ZBrowse captures network request chains:

const CDP = require('chrome-remote-interface');

async function captureLoadingBehavior(url) {
  const client = await CDP();
  const {Network, Page, Runtime} = client;
  
  const requests = [];
  const dependencies = new Map();
  
  // Track every network request with initiator information
  Network.requestWillBeSent((params) => {
    requests.push({
      requestId: params.requestId,
      url: params.request.url,
      initiator: params.initiator,
      timestamp: params.timestamp,
      type: params.type
    });
    
    // Build dependency tree from initiator chains
    if (params.initiator.type === 'parser' || params.initiator.type === 'script') {
      const parent = params.initiator.url || params.documentURL;
      if (!dependencies.has(parent)) {
        dependencies.set(parent, []);
      }
      dependencies.get(parent).push(params.request.url);
    }
  });
  
  await Network.enable();
  await Page.enable();
  await Page.navigate({url});
  await Page.loadEventFired();
  
  // Wait for network idle
  await new Promise(resolve => setTimeout(resolve, 5000));
  
  return {
    url,
    requests,
    dependencyTree: Object.fromEntries(dependencies),
    metrics: await Performance.getMetrics()
  };
}

This approach captures not just what resources loaded, but why they loaded—which script requested which tracker, which stylesheet pulled in which font. The initiator chains become a dependency graph revealing architectural patterns invisible to traditional web crawlers.

ZBrowse's real power emerges in batch processing scenarios. The tool is designed to spawn multiple Chromium instances in parallel, each isolated in its own process with fresh browser state. This matters enormously for measurement accuracy—cookies, cache, and localStorage from one site can't leak into measurements of another. The JSON output format is deliberately machine-readable, designed to pipe into analysis frameworks that aggregate patterns across thousands of captures.

The tool also instruments CDP's Runtime domain to capture JavaScript execution contexts and evaluate custom measurement scripts within the page context. This enables researchers to extract specific DOM properties or execute custom fingerprinting logic:

// Evaluate code in page context to extract architectural details
const result = await Runtime.evaluate({
  expression: `(function() {
    return {
      frameCount: window.frames.length,
      thirdPartyScripts: Array.from(document.scripts)
        .filter(s => new URL(s.src).origin !== location.origin)
        .map(s => s.src),
      canvasFingerprinting: !!document.querySelector('canvas')
    };
  })()`,
  returnByValue: true
});

The architecture assumes you're running on infrastructure capable of spawning hundreds of Chrome instances—this isn't designed for your laptop. The ZMap Project ethos shows through: instrument everything, capture raw data, analyze offline. ZBrowse doesn't try to interpret what it finds; it just builds a comprehensive record of what happened when Chrome loaded that URL.

Gotcha

ZBrowse's documentation effectively assumes you already understand both the Chrome DevTools Protocol and the specific needs of internet measurement research. There's no schema documentation for the JSON output, no explanation of what fields mean, and no guidance on interpreting the dependency trees it captures. If you're not already familiar with CDP's Network.requestWillBeSent event structure or Runtime evaluation contexts, you'll spend significant time reverse-engineering the output format by reading the source code.

The "active development" warning in the repository is no joke. Headless Chromium breaks compatibility regularly as the Chrome team iterates on CDP domains and deprecates features. ZBrowse has minimal abstraction over these changes, meaning an upstream Chromium update can suddenly break your measurement pipeline. There's no versioning strategy for locking to specific Chromium releases, and the small community (62 stars) means you're largely on your own for troubleshooting. This is research-grade software where you're expected to understand the entire stack when things break. Production use cases requiring reliability should look elsewhere—this tool prioritizes measurement accuracy over stability guarantees.

Verdict

Use if you're conducting academic research or security analysis requiring detailed architectural fingerprinting of websites at scale, especially if you need dependency graphs showing how scripts, trackers, and resources chain together. ZBrowse excels when you're measuring thousands of sites and need structured, reproducible data about web architecture patterns that can feed into statistical analysis or longitudinal studies. It's particularly valuable if you're already working with ZMap Project tools and need browser-based measurement to complement network scanning. Skip if you need general web automation, testing, or scraping—Puppeteer and Playwright provide far better documentation, stability, and developer experience for those use cases. Also skip if you're building production systems requiring reliability guarantees, or if you're not prepared to maintain compatibility with upstream Chromium changes yourself. This is a specialized measurement instrument, not a general-purpose automation framework.

ZBrowse: Chrome DevTools Protocol for Internet-Scale Website Archaeology

ZBrowse: Chrome DevTools Protocol for Internet-Scale Website Archaeology

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

ZBrowse: Chrome DevTools Protocol for Internet-Scale Website Archaeology

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// RELATED

Free-AI-Social-Media-Scheduler: A 2,000-Star Repository With Zero Lines of Code

jam-nodes: Type-Safe Workflow Nodes That Stop Before They Become an Orchestrator

Puppeteer: How Chrome's DevTools Protocol Became the Standard for Browser Automation

Inside awesome-selfhosted: How a 292K-Star GitHub List Became the Self-Hosting Movement's Central Nervous System

Free-AI-Social-Media-Scheduler: A 2,000-Star Repository With Zero Lines of Code

jam-nodes: Type-Safe Workflow Nodes That Stop Before They Become an Orchestrator

Puppeteer: How Chrome's DevTools Protocol Became the Standard for Browser Automation

// CODEBASE INTELLIGENCE

Best for

Skip when