Katana: Why ProjectDiscovery Built a Crawler That Speaks Both HTTP and Chrome

Hook

Most web crawlers force you to choose between speed and JavaScript coverage. Katana refuses to pick a side—and that architectural decision reveals something fascinating about how modern security tools handle the SPA epidemic.

Context

Traditional web crawlers live in two incompatible camps. Tools like wget and curl derivatives excel at speed, parsing HTML and following links at hundreds of requests per second. But they're blind to JavaScript-rendered content, missing the API endpoints, dynamic routes, and hidden forms that comprise most modern web applications. On the other side, headless browser solutions like Puppeteer see everything but crawl at a glacial pace, consuming gigabytes of RAM to render pages that may contain nothing new.

For security researchers and bug bounty hunters, this divide is painful. Reconnaissance—the critical first phase of security testing—demands comprehensive coverage. Miss a single endpoint and you miss a potential vulnerability. Yet time and resources aren't infinite. You can't afford to headlessly crawl a 10,000-page application, but you also can't afford to skip the single-page admin panel that only renders via React. ProjectDiscovery built Katana to solve this dilemma for their own security workflows, creating a crawler that switches modes based on what it discovers rather than forcing you to choose upfront.

Technical Insight

Katana's architecture centers on a mode-switching engine that maintains two parallel crawling pipelines. The standard mode uses Go's native HTTP client with custom HTML parsing to extract links, forms, and static JavaScript references. When it encounters indicators of heavy JavaScript usage—specific meta tags, empty HTML bodies with script tags, or user-defined patterns—it can route those URLs to a headless Chrome instance managed via the Chrome DevTools Protocol.

This dual-mode approach appears in the configuration options. You can invoke standard crawling with basic flags:

katana -u https://api.example.com \
  -d 5 \
  -jc \
  -kf robotstxt,sitemapxml \
  -em js,css,png,jpg \
  -fs rdn \
  -o discovered_endpoints.txt

Here, -jc enables JavaScript file parsing to extract endpoints from static .js files, -kf specifies known file types to prioritize (robots.txt, sitemaps), -em excludes media extensions, and -fs applies field scope to filter by domain. This mode processes hundreds of URLs per minute with minimal resource overhead.

When you need headless coverage, the architecture shifts entirely:

katana -u https://app.example.com \
  -headless \
  -hl \
  -xhr-extraction \
  -form-extraction \
  -automatic-form-fill \
  -strategy depth-first \
  -d 3 \
  -o spa_endpoints.txt

The -hl flag activates headless with link navigation, -xhr-extraction monitors all XHR/fetch requests made by JavaScript, and -form-extraction captures dynamically generated forms. The -automatic-form-fill feature deserves particular attention—it attempts to intelligently populate form fields with test data to trigger additional application states, though ProjectDiscovery marks it experimental for good reason.

What makes Katana architecturally interesting is how it handles scope management in an automation context. Security tools often suffer from scope creep, following links to CDNs, analytics providers, and third-party services. Katana builds scope control into its core with filter chains:

katana -u https://example.com \
  -f qurl \
  -fs dn \
  -mr '(api|admin|dashboard)' \
  -filter-regex '.*\.example\.com.*' \
  -exclude-cdn \
  -exclude-private \
  -o filtered_results.txt

The -f qurl flag outputs only unique URLs based on query parameters, -fs dn restricts field scope to domain name, and -mr matches regex patterns in discovered paths. The exclusion flags leverage built-in lists—-exclude-cdn blocks known CDN domains while -exclude-private filters RFC1918 addresses and localhost.

Under the hood, Katana uses a similarity hashing algorithm to prevent redundant crawling of near-duplicate pages. Many web applications generate thousands of pagination URLs or calendar views that differ only in parameters. The tool computes SimHash fingerprints of page structure and content, grouping similar pages to avoid crawling every variation. You control the sensitivity threshold, balancing between duplicate reduction and missing legitimate unique content.

The JSLuice integration represents another architectural decision worth examining. Rather than regex-based JavaScript parsing (fragile and prone to false positives), Katana incorporates JSLuice—a JavaScript analysis library that understands syntax trees. It extracts API endpoints, route definitions, and configuration objects from minified production code:

katana -u https://example.com \
  -js-crawl \
  -jsl \
  -known-files endpoints \
  -o api_surface.txt

The -jsl flag enables JSLuice parsing, while -known-files endpoints focuses on common API definition files like swagger.json or openapi.yaml. This discovers endpoints that never appear in HTML or network requests during normal browsing—routes defined in JavaScript bundles for future features or admin functions.

Katana outputs structured data in multiple formats (JSON, plaintext, CSV) designed for pipeline integration. The JSON output includes metadata like HTTP status, content length, and extraction source, enabling downstream filtering. Combined with ProjectDiscovery's other tools (Nuclei for vulnerability scanning, httpx for probing), it forms an automation chain where Katana's discovered endpoints feed directly into security testing workflows.

Gotcha

The headless mode's resource consumption isn't trivial overhead—it's a fundamental constraint. Each Chrome instance requires 200-500MB of base memory before loading any pages, and complex SPAs can push that to multiple gigabytes. Running parallel headless crawls (enabled via the -c concurrency flag) multiplies this linearly. On a 16GB system, you're realistically limited to 5-10 concurrent headless crawlers before hitting memory pressure. This makes large-scale reconnaissance of multiple targets impractical without significant infrastructure, contrasting sharply with standard mode's ability to handle hundreds of concurrent crawls.

The automatic form filling feature, while innovative, carries real risk. The experimental nature isn't just a disclaimer—it can submit contact forms, trigger account actions, or POST data to production endpoints. The form-filling logic uses heuristics to guess field types (email, password, text) and populates them with test values, but it doesn't understand business logic. On an e-commerce site, it might attempt to submit orders. On an admin panel, it might try to create users. There's no dry-run mode or simulation capability; when enabled, it acts on live systems. Security researchers need to understand their authorization level and potential impact before activating this flag, and it's completely inappropriate for bug bounty programs with strict testing scopes.

The CGO requirement introduces deployment friction in containerized environments. Unlike pure Go binaries that compile to truly static executables, Katana's dependencies require CGO_ENABLED=1, meaning you need a C compiler and standard library in your build environment. Cross-compilation becomes complex—building for Linux ARM64 from a Mac requires target-specific toolchains. In Docker contexts, this inflates image sizes and complicates multi-stage builds. While ProjectDiscovery provides pre-built binaries, teams wanting to integrate Katana into custom toolchains or modify the source face this compilation complexity.

Verdict

Use Katana if you're conducting security reconnaissance on modern web applications, especially in bug bounty or penetration testing contexts where you need comprehensive endpoint discovery within strict scopes. It excels when targets mix static content with JavaScript-heavy sections, letting you apply expensive headless crawling selectively. The ProjectDiscovery ecosystem integration makes it particularly valuable if you're already using Nuclei or httpx in automation pipelines. Use it when you need intelligent scope control—the regex filtering, CDN exclusion, and similarity deduplication are first-class features, not afterthoughts. Skip it if you're doing general web scraping for data extraction (Scrapy's middleware ecosystem is more mature), need crawling as part of a larger browser automation workflow (Playwright gives you more control), or are resource-constrained without access to systems that can handle headless Chrome. Also skip it for crawling simple static sites where a basic HTTP crawler would suffice—you're adding complexity and dependencies without gaining capability. If you need production-grade reliability with comprehensive error handling and retry logic for business-critical scraping, more established frameworks remain the safer choice.

Katana: Why ProjectDiscovery Built a Crawler That Speaks Both HTTP and Chrome

Katana: Why ProjectDiscovery Built a Crawler That Speaks Both HTTP and Chrome

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

Katana: Why ProjectDiscovery Built a Crawler That Speaks Both HTTP and Chrome

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// RELATED

Free-AI-Social-Media-Scheduler: A 2,000-Star Repository With Zero Lines of Code

jam-nodes: Type-Safe Workflow Nodes That Stop Before They Become an Orchestrator

Puppeteer: How Chrome's DevTools Protocol Became the Standard for Browser Automation

Inside awesome-selfhosted: How a 292K-Star GitHub List Became the Self-Hosting Movement's Central Nervous System

Free-AI-Social-Media-Scheduler: A 2,000-Star Repository With Zero Lines of Code

jam-nodes: Type-Safe Workflow Nodes That Stop Before They Become an Orchestrator

Puppeteer: How Chrome's DevTools Protocol Became the Standard for Browser Automation

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]