Back to Articles

galer: JavaScript-Aware URL Extraction Without the Regex Headaches

[ View on GitHub ]

galer: JavaScript-Aware URL Extraction Without the Regex Headaches

Hook

You’re parsing HTML with regex to find URLs. The page loads 47 endpoints via JavaScript after DOM ready. Your regex sees none of them. This is where headless crawlers earn their keep.

Context

Security reconnaissance workflows have a URL discovery problem. Traditional tools grep through HTML source looking for href and src attributes, but modern web applications render most of their navigation structure client-side. A React app might ship with three hardcoded links in the initial HTML, then fetch and render 200 API endpoints after JavaScript executes. Static parsers miss this entirely.

The typical workaround involves running full browser automation frameworks like Puppeteer or Selenium, which work but carry significant overhead. You’re scripting browser interactions, managing page lifecycle events, and writing DOM traversal code just to answer a simple question: what URLs does this page know about? galer was built to solve exactly this problem—extracting URLs from JavaScript-rendered pages without the ceremony of full browser automation frameworks. Inspired by a tweet from security researcher Omar Espino, it uses Chrome DevTools Protocol to evaluate pages and extract values from src, href, url, and action attributes, then exits. No page interaction scripting required.

Technical Insight

Target URLs

Spawn browser

Wait & render

Query attributes

Raw URLs

same-host/same-root

New pages within depth

Input Sources

stdin/file/URL

URL Queue

Concurrent Worker Pool

Headless Chrome

CDP Connection

DOM Evaluator

JavaScript Execution

Attribute Extractor

src/href/url/action

URL Filter

Extension/Domain Scope

Extracted URLs

stdout/library return

System architecture — auto-generated

At its core, galer appears to spawn a headless Chrome instance and communicate via the DevTools Protocol to execute JavaScript and query the rendered DOM. Unlike static HTML parsers that work with raw response bodies, galer waits for page evaluation (configurable via the —wait flag, defaulting to 1 second) to let JavaScript populate the DOM, then extracts attribute values.

The tool supports both CLI and library usage. As a command-line tool, it’s designed for pipeline integration with other reconnaissance tools. Here’s a realistic security recon workflow:

# Discover subdomains, probe for live hosts, extract all URLs
subfinder -d example.com -silent | httpx -silent | galer --same-root -e js,json

This chain discovers subdomains, filters to live HTTP services, then crawls each to extract JavaScript and JSON file URLs while staying within the same root domain. The —same-root flag implements eTLD+1 scoping, so URLs from api.example.com and www.example.com are both collected, but links to external CDNs are filtered out.

Using galer as a library offers more control over the crawling process. The package exposes a Config struct with options for timeout, concurrency, and filtering:

package main

import (
    "fmt"
    "github.com/dwisiswant0/galer/pkg/galer"
)

func main() {
    cfg := &galer.Config{
        Timeout: 60,
    }
    cfg = galer.New(cfg)

    urls, err := cfg.Crawl("https://app.example.com")
    if err != nil {
        panic(err)
    }

    for _, url := range urls {
        fmt.Println(url)
    }
}

The depth parameter enables recursive crawling, following links to a specified level. With —depth 2, galer crawls the initial target, extracts URLs, visits those pages, and extracts their URLs. Combined with —concurrency 50 (the default), this enables fast parallel crawling of entire site sections.

One particularly clever feature is the template-based output formatting. Instead of dumping raw URLs, you can extract specific components:

galer -u https://example.com -T "{{scheme}}://{{host}}{{path}}"

This strips query parameters and fragments, normalizing URLs for deduplication. The template system exposes variables like scheme, host, port, path, raw_query, and fragment—useful when you need to feed URLs into tools that expect specific formats or when building custom processing pipelines.

The extension filtering (—extension js,php) uses a whitelist approach, showing only URLs ending with specified extensions. This is particularly valuable in bug bounty workflows where you’re hunting for exposed JavaScript bundles that might contain API keys or internal endpoints. Combined with domain scoping, you can build very specific crawls: “show me all .js files from the same root domain, three links deep.”

Gotcha

The Chrome/Chromium dependency is galer’s biggest operational constraint. Unlike pure Go crawlers that compile to a single binary, galer requires a headless browser installation on the system. In containerized environments, this means including a chrome-headless installation, which adds significant size to container images. For AWS Lambda or similar environments with tight size limits, this may be prohibitive.

The project’s pre-1.0 status is clearly flagged in the README with a caution notice: the API is unstable and not recommended for production use. Several planned features remain unimplemented—the TODO list shows missing support for custom HTTP headers and User-Agent rotation, both important for crawling sites with bot detection. If you need to send authentication tokens or bypass basic fingerprinting, you’ll need to fork and implement it yourself.

Performance characteristics aren’t documented in the README. With default concurrency of 50 and each crawl requiring a headless browser instance, resource consumption could become significant on large crawls. There’s no guidance on reasonable concurrency limits for different system specs, and no circuit breakers or rate limiting mentioned. The —wait flag also introduces cumulative delay—waiting 1 second per page means crawling 1,000 pages takes a minimum of 1,000 seconds, even with perfect parallelization.

Verdict

Use galer if you’re building security reconnaissance pipelines where JavaScript-rendered content matters, especially in bug bounty or penetration testing workflows. It excels when chained with tools like subfinder and httpx for automated subdomain enumeration and URL discovery. The library interface is perfect for custom recon tools that need URL extraction as one step in a larger workflow. Skip it if you need production stability (the pre-1.0 warning is explicit), if you’re deploying to size-constrained environments where a Chrome dependency is prohibitive, if you need authentication or custom headers (not yet implemented according to the TODO list), or if you’re crawling static sites where simpler parsing approaches would suffice. For JavaScript-heavy modern web apps in security reconnaissance contexts, galer provides exactly what’s needed without the overhead of full browser automation frameworks.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/ai-dev-tools/dwisiswant0-galer.svg)](https://starlog.is/api/badge-click/ai-dev-tools/dwisiswant0-galer)