Back to Articles

Hounds: A Chromium-Based Web Crawler That Thinks Like a Bug Bounty Hunter

[ View on GitHub ]

Hounds: A Chromium-Based Web Crawler That Thinks Like a Bug Bounty Hunter

Hook

Most web crawlers fail the moment they encounter a single-page React application or bot detection middleware. Hounds solves this by being indistinguishable from a real user—because it literally is one.

Context

Traditional web crawlers built with HTTP client libraries like requests or axios struggle with modern web applications. They can't execute JavaScript, trigger dynamic content loading, or bypass bot detection mechanisms that check for browser fingerprints. This creates a blind spot for security researchers and bug bounty hunters who need to enumerate attack surface on JavaScript-heavy targets.

The problem becomes particularly acute during reconnaissance phases of penetration testing. You need to discover every endpoint, parameter, and form submission within your defined scope—but CDN resources, third-party analytics, and advertising networks create noise that obscures the actual attack surface. Tools like Burp Suite's spider work well but are resource-intensive and require manual configuration. Lightweight crawlers like hakrawler are fast but can't render JavaScript. Hounds fills this gap by combining Puppeteer's full browser automation with intelligent scope filtering and form interaction specifically designed for security reconnaissance workflows.

Technical Insight

Hounds is architecturally interesting because it treats web crawling as a browser automation problem rather than an HTTP client problem. Built on Node.js and Puppeteer, it launches a headless (or headed, if you prefer visibility) Chromium instance and intercepts every network request at the browser level using Puppeteer's request interception API.

The core crawling logic implements breadth-first search, which is an unconventional but smart choice for security work. Instead of recursively diving deep into the first link it finds (depth-first), Hounds explores all links at the current level before moving deeper. This ensures better coverage when you're working with time constraints or rate limits. On a site with a deep but narrow documentation section and a shallow but wide API reference, breadth-first search discovers the API endpoints faster.

Here's how Hounds intercepts and filters requests:

await page.setRequestInterception(true);
page.on('request', (request) => {
  const url = request.url();
  const hostname = new URL(url).hostname;
  
  // Check if hostname ends with any scope suffix
  const inScope = scopeList.some(scope => 
    hostname.endsWith(scope)
  );
  
  if (inScope && !visited.has(url)) {
    discoveredUrls.add(url);
  }
  
  request.continue();
});

This interception happens before the browser makes the actual network call, allowing Hounds to capture everything—AJAX requests, dynamically loaded images, API calls triggered by JavaScript—without parsing HTML or making assumptions about document structure.

The form interaction feature is where Hounds gets particularly clever. Instead of blindly submitting every form it encounters, it generates a hash of the form's DOM structure (input names, types, and arrangement). This hash serves as a deduplication key, ensuring that a login form appearing in the header of every page only gets submitted once:

const formElements = await page.$$('form');

for (const form of formElements) {
  const formData = await form.evaluate(el => {
    const inputs = Array.from(el.querySelectorAll('input, select, textarea'));
    return inputs.map(input => ({
      name: input.name,
      type: input.type,
      tag: input.tagName
    }));
  });
  
  const formHash = generateHash(JSON.stringify(formData));
  
  if (!submittedForms.has(formHash)) {
    await form.evaluate(f => f.submit());
    submittedForms.add(formHash);
  }
}

This approach discovers hidden endpoints that only appear after form submission—password reset flows, multi-step wizards, or admin panels that check for POST parameters before rendering.

The proxy integration is straightforward but essential for security workflows. By configuring Puppeteer to route all traffic through a local proxy, you can pipe Hounds' discoveries directly into Burp Suite or OWASP ZAP:

const browser = await puppeteer.launch({
  headless: true,
  args: [
    '--proxy-server=127.0.0.1:8080',
    '--ignore-certificate-errors'
  ]
});

The output format supports both simple URL lists (for quick enumeration) and JSON objects containing full HTTP request details including headers, methods, and POST data. This dual output makes Hounds flexible for both automated pipelines and manual analysis.

One architectural decision worth noting: Hounds maintains separate sets for visited URLs versus discovered URLs. A URL is "discovered" when the browser makes a request to it (even if it's a 404 or redirect), but it's only "visited" after Hounds actually loads it in a page context and crawls its links. This distinction prevents infinite loops while ensuring comprehensive network-level discovery.

Gotcha

The scope filtering mechanism is Hounds' most significant limitation. It uses simple hostname suffix matching, which means you can't express complex scope definitions like "all of example.com except /admin" or "*.api.example.com but not legacy.api.example.com". You'll end up with either false positives (out-of-scope content) or false negatives (missing in-scope URLs) on programs with nuanced scope requirements. There's no support for path-based filtering, regex patterns, or exclusion rules beyond the hostname level.

Form submission behavior is another pain point. Hounds fills all form fields with empty values by default, which means it won't discover content behind authentication walls or parameter-dependent logic. If a search form requires a non-empty query string to return results, or a configuration page checks for valid input before revealing additional options, Hounds will miss that content entirely. You'd need to fork the code and implement custom form-filling logic for your specific target. The repository appears unmaintained (last commit was years ago based on the star count and activity), so expect compatibility issues with recent Puppeteer versions and Chromium builds. Running a browser instance for every request also makes Hounds significantly slower and more resource-intensive than client-based alternatives—expect 2-10 seconds per page compared to milliseconds for tools like hakrawler.

Verdict

Use Hounds if you're reconnaissance-crawling modern JavaScript-heavy web applications for bug bounty or penetration testing work, especially when target sites employ bot detection that blocks traditional crawlers. The breadth-first approach and automatic form discovery make it valuable for comprehensive attack surface mapping, and the proxy integration fits naturally into existing security workflows. The tool shines on single-page applications, React/Vue/Angular frontends, and sites with heavy AJAX usage where client-based crawlers fall flat. Skip Hounds if you're working with simple static sites where speed matters more than JavaScript execution, if you need authenticated crawling with realistic form data, or if your scope requirements involve complex path-based rules. The apparent lack of maintenance means you'll likely need to update dependencies yourself. For production security pipelines, consider more actively maintained alternatives like Katana from ProjectDiscovery, which offers similar functionality with better scope control and ongoing support.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/automation/wdahlenburg-hounds.svg)](https://starlog.is/api/badge-click/automation/wdahlenburg-hounds)