Katana: Building a Modern Web Crawler That Actually Understands JavaScript

Hook

Most web crawlers treat JavaScript like a black box. Katana doesn’t just execute it—it parses, extracts endpoints from it, and gives you control over whether you want speed or completeness.

Context

Web crawling for security research has always forced an uncomfortable choice: go fast with simple HTTP clients that miss JavaScript-rendered content, or go slow with headless browsers that burn CPU and memory. Modern web applications don’t play by those rules. Single-page applications hide critical endpoints in JavaScript bundles. API calls happen via XHR after page load. Forms auto-populate based on client-side logic. ProjectDiscovery built Katana as a next-generation crawling and spidering framework—designed for offensive security workflows like bug bounty hunting, penetration testing, and attack surface mapping—where you need to discover every possible endpoint, parameter, and form field. The name references the Japanese sword: fast, sharp, and purpose-built. Unlike general web scraping frameworks, Katana appears optimized for discovering attack surface rather than extracting data, based on its rich filtering capabilities and multiple output formats designed to feed other tools in reconnaissance pipelines.

Technical Insight

System architecture — auto-generated

Katana’s architecture centers on mode switching between standard HTTP crawling and headless Chrome automation. The standard mode uses native Go HTTP clients—fast, lightweight, and perfect for static content. The headless mode (-hl or -headless flag) enables hybrid crawling with a browser, executing JavaScript and capturing dynamic requests. Here’s where it gets interesting: you can enable JavaScript parsing (-jc) without headless mode. Katana will fetch JS files over HTTP, then statically analyze them to extract endpoint patterns. For more aggressive JavaScript parsing, the -jsl (jsluice) flag is available, though the README explicitly warns this is memory-intensive. This hybrid approach gives you flexibility between resource consumption and coverage.

The crawling strategy itself is configurable via -s depth-first or -s breadth-first. Depth-first hammers one path until it hits your -depth limit (default 3 levels), useful when you want to thoroughly map one workflow before moving to the next. Breadth-first fans out horizontally, discovering all top-level sections before drilling down—better for quick reconnaissance across a large application. Here’s a practical example for a recon workflow:

# Quick breadth-first recon with JS parsing
katana -u https://target.com \
  -d 4 \
  -s breadth-first \
  -jc \
  -kf all \
  -fx \
  -jsonl \
  -o endpoints.jsonl

This crawls 4 levels deep, enables JavaScript endpoint extraction, crawls known files (robots.txt, sitemapxml via -kf all), extracts form structures, and outputs structured JSON. The -fx (form-extraction) flag serializes every form, input field, textarea, and select element it finds. For authenticated testing, you’d add headers:

katana -u https://app.target.com/dashboard \
  -H "Authorization: Bearer eyJ..." \
  -H "Cookie: session=abc123" \
  -headless \
  -aff \
  -d 5

The -aff (automatic-form-fill) flag is marked experimental in the README. When enabled, Katana attempts to populate form fields based on configurable rules. In headless mode, it can submit these forms and follow resulting requests, potentially discovering authenticated endpoints.

Scope control is extensive. The -iqp (ignore-query-params) flag treats URLs with different query parameters as identical (/users?id=1 and /users?id=2 become one). The -fsu (filter-similar) flag with -fst threshold (default 10) uses heuristics to detect parameterized paths—after seeing the threshold number of distinct values in a path position, it treats that segment as a parameter and stops crawling similar URLs. This prevents crawling thousands of pagination links or user profile variations. You can also exclude hosts:

katana -u https://target.com -exclude cdn,private-ips

The README documents these filters: cdn, private-ips, CIDR blocks, IP addresses, and regex patterns.

For integration into automated pipelines, Katana accepts input from stdin (-list supports STDIN, URL, and LIST) and outputs to stdout, making it pipe-friendly:

cat subdomains.txt | katana -jc -d 3 | grep -E "\.(js|json)$"

The -jsonl output mode provides structured data. The README also documents OUTPUT options: STDOUT, FILE, and JSON formats.

Gotcha

Headless mode is resource-intensive. The installation requirements reveal this—CGO_ENABLED=1 is required for the Go installation, which breaks the typical Go promise of static binary deployment. You need a C compiler and Chrome installed (or use the -system-chrome flag to use a locally installed Chrome browser), which complicates deployment. ProjectDiscovery provides Docker images that handle this complexity, with specific instructions for running in headless mode using docker run projectdiscovery/katana:latest -u https://tesla.com -system-chrome -headless.

The -aff automatic form filling is marked experimental in the README for good reason. While it enables automatic form filling, anything with CAPTCHA, complex CSRF token requirements, or multi-step wizards may fail. The form configuration file (-fc) lets you customize field-filling logic via a custom form configuration file, but complex interaction flows require careful configuration. The README also mentions captcha solver integration via -csp (captcha-solver-provider) and -csk (captcha-solver-key) flags.

JavaScript parsing with -jsl (jsluice mode) is explicitly called out in the README as memory-intensive. If you’re crawling targets with large JS bundles, expect significant memory usage. Monitor your resource usage and consider running in batches rather than parallel.

Similar URL filtering (-fsu) uses heuristics based on counting distinct values per path segment. The default threshold (-fst) is 10 distinct values. If you have legitimate routes that happen to look parameterized (/products/shoes, /products/shirts, /products/hats…), Katana will stop crawling after the threshold, potentially missing real endpoints. You need to understand your target’s URL structure before enabling this filter.

The README also notes a minimum depth of 3 is required for the -kf (known-files) flag to ensure all known files are properly crawled. Running with insufficient depth may result in incomplete coverage of robots.txt and sitemap.xml discoveries.

Verdict

Use Katana if you’re doing offensive security work and need comprehensive attack surface mapping. The dual-mode architecture gives you flexibility to choose speed (standard mode with JS parsing) or completeness (headless mode). The filtering and scope controls are well-documented for large-scale recon across multiple targets without drowning in noise. It integrates well into automation pipelines via stdin/stdout support and structured JSON output, making it suitable for feeding discovered endpoints to other tools. Use it when you need to discover endpoints, extract forms, and map attack surface with configurable depth and scope. Skip it if you’re working in resource-constrained environments where you can’t afford the headless overhead or meet the CGO_ENABLED=1 requirement. Also skip it if you need production-grade form automation with complex auth flows—the README clearly marks automatic form filling as experimental. For simple link extraction from static sites, simpler tools may be faster. Katana is a specialist tool for security reconnaissance with rich configuration options, and it appears well-suited for that specific job based on its feature set.

Katana: Building a Modern Web Crawler That Actually Understands JavaScript

Katana: Building a Modern Web Crawler That Actually Understands JavaScript

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

Katana: Building a Modern Web Crawler That Actually Understands JavaScript

Hook

Context

Technical Insight

Gotcha

Verdict

// RELATED

Crawlee: Building Production-Grade Web Scrapers Without Reinventing Infrastructure

Firecrawl: The Web Scraping API Built for LLMs That Actually Handles JavaScript

Building Production Web Scrapers in Go with Colly's Event-Driven Architecture

ScrapeGraphAI: Why Replacing XPath with Natural Language Might Actually Make Sense

Crawlee: Building Production-Grade Web Scrapers Without Reinventing Infrastructure

Firecrawl: The Web Scraping API Built for LLMs That Actually Handles JavaScript

Building Production Web Scrapers in Go with Colly's Event-Driven Architecture

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]