Back to Articles

galer: JavaScript-Aware URL Extraction for Security Reconnaissance

[ View on GitHub ]

galer: JavaScript-Aware URL Extraction for Security Reconnaissance

Hook

Most URL crawlers see only half the attack surface. When JavaScript renders client-side routes, builds URLs dynamically, or lazy-loads content, traditional HTML parsers are flying blind—and that's exactly where vulnerabilities hide.

Context

Security researchers and penetration testers face a fundamental problem: modern web applications don't put all their URLs in static HTML anymore. Single-page applications built with React, Vue, or Angular construct routes in JavaScript. APIs endpoints get assembled at runtime. Navigation menus populate after AJAX calls complete. Traditional crawlers that parse raw HTML responses miss this entirely, leaving potentially vulnerable endpoints undiscovered.

The existing tooling landscape splits into two camps: lightweight tools that parse HTML quickly but miss JavaScript-generated content, and heavy browser automation frameworks like Selenium or Puppeteer that require substantial boilerplate just to extract URLs. galer occupies the middle ground, providing JavaScript-aware crawling specifically optimized for URL extraction workflows common in bug bounty hunting and security assessments. It leverages Chrome DevTools Protocol to render pages fully while maintaining the pipeline-friendly interface that security tools expect.

Technical Insight

At its core, galer wraps Chrome DevTools Protocol in a focused interface designed for one task: extracting every URL a browser would see. When you point galer at a page, it launches a headless Chromium instance, waits for JavaScript execution to complete, then evaluates the DOM to extract URLs from src, href, url, and action attributes across all elements. This approach catches dynamically inserted script tags, AJAX-loaded images, and client-side routing that would be invisible to curl | grep.

The tool's architecture reveals thoughtful design for security workflows. Here's basic usage extracting URLs from a single target:

# Basic crawl with depth 2
galer -u https://example.com -d 2

# Pipeline integration: subdomain enumeration to URL discovery
subfinder -d example.com | httpx | galer -c 10

# Filter to same-host URLs only, exclude common static files
galer -u https://example.com -s -e js,css,png,jpg

The concurrency model deserves attention. The -c flag controls how many Chrome instances run simultaneously. Each instance carries significant overhead—typically 100-200MB of memory—so this isn't like setting 100 concurrent HTTP requests in a traditional crawler. In practice, values between 5-15 work well depending on your system resources. This is one area where galer's JavaScript-rendering advantage comes with real costs.

What makes galer particularly valuable is its template-based output system. Security workflows often need specific URL components for downstream processing. Instead of parsing URLs in each script, galer handles it:

# Extract just hostnames for unique target list
galer -u https://example.com -t '{{.Host}}' | sort -u

# Build custom output format for further processing
galer -u https://example.com -t '{{.Scheme}}://{{.Host}}{{.Path}}'

# Extract paths only, useful for endpoint analysis
galer -u https://example.com -t '{{.Path}}' | sort -u

The template syntax follows Go's text/template package, exposing fields like Scheme, Host, Port, Path, and RawQuery. This transforms galer from a simple URL extractor into a flexible component for building reconnaissance pipelines.

For integration into custom tooling, galer functions as a Go library. The API surface is minimal but effective:

package main

import (
    "fmt"
    "github.com/dwisiswant0/galer"
)

func main() {
    options := &galer.Options{
        Depth: 2,
        SameHost: true,
        Concurrency: 5,
    }
    
    crawler := galer.New(options)
    
    urls, err := crawler.Crawl("https://example.com")
    if err != nil {
        panic(err)
    }
    
    for _, url := range urls {
        fmt.Println(url)
    }
}

The same-host and same-root filtering options address a common reconnaissance challenge. When crawling a target, you often want URLs that remain within scope. The -s flag restricts output to URLs sharing the same hostname as the seed URL, while -r narrows further to the same root path. This prevents scope creep when crawling large sites with extensive external links.

Under the hood, galer uses the chromedp library for Chrome DevTools Protocol communication. This is leaner than Selenium—no WebDriver binary, no external process management beyond Chrome itself. The tool handles the lifecycle: launching Chrome with appropriate flags, navigating to targets, waiting for network idle, extracting URLs, and cleanup. For security researchers, this removes the tedious browser automation code that would otherwise bloat every tool.

Gotcha

The Chrome dependency cuts both ways. While it enables JavaScript rendering, it also means galer won't run in minimal Docker containers without adding Chromium and its dependencies—roughly 300MB of additional image size. Deployment to AWS Lambda or similar serverless environments requires custom runtimes with Chrome pre-installed. This isn't a showstopper, but it's substantially more complex than deploying a static Go binary that makes HTTP requests.

The pre-1.0 version status isn't just a formality. The README explicitly warns against API stability for library usage, and examining the commit history shows the interface has changed. If you're building production tools that import galer as a library, pin to a specific commit hash and expect to handle breaking changes. For CLI usage in scripts, this matters less—the flag interface has remained relatively stable. Performance characteristics also require consideration. Because each crawl spawns Chrome instances, galer consumes orders of magnitude more resources than lightweight crawlers. Testing against a list of 1000 URLs with concurrency set to 10 can easily consume 2-3GB of RAM. For large-scale reconnaissance across thousands of targets, you'll need to batch carefully or accept longer run times with lower concurrency.

Verdict

Use if: You're performing security reconnaissance where JavaScript-rendered content matters—bug bounties, penetration testing, attack surface mapping. You need to discover client-side routes, dynamically loaded resources, or AJAX endpoints that static parsers miss. You're building security tool pipelines and want stdin/stdout composability. You have adequate system resources and can accommodate the Chrome dependency. Skip if: You're working in resource-constrained environments like minimal containers or embedded systems. You need production-stable APIs for long-term library integration without maintenance overhead. Your targets are primarily static HTML sites where JavaScript rendering provides minimal value. You require features like custom headers or anti-detection measures that aren't yet implemented. For those cases, consider lighter alternatives like hakrawler or wait for more mature tools like katana to add equivalent functionality.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/ai-dev-tools/dwisiswant0-galer.svg)](https://starlog.is/api/badge-click/ai-dev-tools/dwisiswant0-galer)