> your AI agent picks dependencies from memory; give it dated facts — try starlog.dev ↗ vet your agent's deps ↗ vibe-coding is fine. vibe-importing isn’t. — try starlog.dev ↗ vibe-importing isn’t fine ↗ your agent has never seen your private packages — try starlog.dev ↗ facts for private packages ↗ a linter for the dependencies your AI agent picks — try starlog.dev ↗ a linter for agent deps ↗

Back to Articles

webanalyze: Mass Technology Detection Without the Browser Overhead

[ View on GitHub ]

webanalyze: Mass Technology Detection Without the Browser Overhead

Hook

While penetration testers manually click through browser extensions to identify technologies on target websites, webanalyze can fingerprint 10,000 hosts in the time it takes to brew coffee.

Context

Every web application runs on a stack—WordPress with specific plugins, React with particular analytics tools, nginx with certain modules. For security researchers and bug bounty hunters, knowing this technology stack before launching attacks is reconnaissance 101. Wappalyzer pioneered this space with a browser extension that identifies technologies through fingerprinting: matching JavaScript variables, HTTP headers, cookie names, and DOM patterns against a signature database.

But browser extensions don't scale. When you're assessing an organization's entire external attack surface—hundreds or thousands of domains and subdomains—you need automation. You need something that runs headless, outputs machine-readable data, and completes in minutes rather than days. That's the gap webanalyze fills. Robin Verton created it as a Go port of Wappalyzer's detection logic, stripping away the UI and optimizing for concurrent mass scanning. It's the difference between reconnaissance as a manual chore and reconnaissance as an automated pipeline step.

Technical Insight

HTTP GET

Response

HTML, Headers, Cookies

Regex Patterns

Header Signatures

Cookie Patterns

Detected Technologies

Target URLs

Work Queue

Worker Pool

Concurrent Scanners

technologies.json

Signature Database

Pattern Matcher

Target Website

HTML Content Match

HTTP Header Match

Cookie Match

Result Aggregator

Confidence Scoring

Output

JSON/CSV/Stdout

System architecture — auto-generated

webanalyze's architecture centers on two core components: a signature database and a concurrent worker pool that matches HTTP responses against those signatures. The signature database is a JSON file containing fingerprint patterns for thousands of technologies. Each entry specifies detection criteria—regex patterns to match against HTML, HTTP headers to look for, cookies that indicate presence, and confidence weights.

Here's how you'd use webanalyze as a library in your own Go tooling:

package main

import (
    "fmt"
    "github.com/rverton/webanalyze"
)

func main() {
    // Initialize with the technologies database
    wa, err := webanalyze.NewWebAnalyzer("technologies.json", 40)
    if err != nil {
        panic(err)
    }

    // Analyze a single URL
    result, err := wa.Process("https://example.com")
    if err != nil {
        panic(err)
    }

    // Access detected technologies
    for _, app := range result.Applications {
        fmt.Printf("%s (confidence: %d%%)\n", app.Name, app.Confidence)
        for _, version := range app.Versions {
            fmt.Printf("  Version: %s\n", version)
        }
    }
}

The Process method performs an HTTP GET request, then iterates through the signature database looking for matches. Detection happens through multiple signal types. For headers, it might check if X-Powered-By: Express exists. For HTML content, it applies regex patterns like <script[^>]+backbone\.js to identify Backbone.js. For cookies, it looks for names like __cfduid that indicate Cloudflare. Each match contributes to a confidence score.

The worker pool pattern is where webanalyze shines for mass scanning. Rather than processing URLs sequentially, it spawns multiple goroutines that pull from a job queue. You can configure worker count with the -workers flag (defaults to 4). This concurrent design means you're limited by network bandwidth and target response times, not CPU—critical when scanning thousands of endpoints.

Version detection deserves special attention because it's more nuanced than simple presence detection. webanalyze uses regex capture groups from signature patterns to extract version numbers from responses. For example, a pattern might capture the version from a meta tag like <meta name="generator" content="WordPress 6.2.1"> or from JavaScript paths like /wp-includes/js/jquery/jquery.js?ver=3.6.0. This granularity matters enormously in security contexts—knowing a site runs WordPress is useful, but knowing it runs WordPress 5.8.0 (which has known CVEs) is actionable intelligence.

The CLI tool supports several output formats that integrate into different workflows. JSON output (-output json) feeds into tools like jq or custom scripts. CSV format works for spreadsheet analysis or bulk processing. The default stdout is human-readable but still parseable. This flexibility makes webanalyze a natural fit in CI/CD pipelines, periodic asset inventory jobs, or interactive reconnaissance workflows.

For subdomain enumeration workflows, webanalyze includes a -crawl flag that follows links and analyzes discovered pages up to a specified depth. Combined with tools like subfinder or amass that generate subdomain lists, you can pipe thousands of discovered hosts directly into webanalyze for technology profiling. This creates a reconnaissance chain: discover subdomains → identify technologies → prioritize targets based on vulnerable software versions.

Gotcha

The most critical limitation is signature staleness. webanalyze relies on the enthec/webappanalyzer fork of Wappalyzer's signature database because Wappalyzer removed public API access. You need to manually run webanalyze -update to fetch the latest signatures, and there's no guarantee this fork stays current with new technologies or detection pattern improvements. If you're scanning for cutting-edge frameworks or recently released software, your detection quality depends entirely on whether someone has contributed those signatures upstream.

False positives and negatives are inherent to fingerprinting approaches. Technologies can hide their signatures—removing identifying headers, customizing file paths, disabling version exposure. Conversely, webanalyze might detect a technology that was present during initial page load but isn't actually part of the stack (like a third-party script). The confidence scores help but aren't foolproof. For high-stakes assessments, you'll want to verify detections manually or cross-reference with other tools like WhatWeb.

The crawling functionality is rudimentary compared to dedicated crawlers. It doesn't execute JavaScript, so single-page applications built with React or Vue that render content client-side will appear nearly empty. There's no sophisticated URL deduplication, robots.txt respect, or rate limiting beyond worker count. If you need to crawl JavaScript-heavy sites, you'll need a headless browser solution, which defeats the performance advantage that makes webanalyze attractive in the first place.

Verdict

Use if: You're conducting security assessments or bug bounties where you need to inventory technologies across dozens to thousands of domains, you're building automated reconnaissance pipelines that need machine-readable output, or you want technology detection as a Go library integrated into custom tooling. It's perfect for situations where speed and scale matter more than 100% detection accuracy. Skip if: You're only analyzing a handful of sites where the browser extension works fine, you need guaranteed up-to-date signatures for the latest frameworks (Wappalyzer's commercial API is better), you're targeting JavaScript-heavy SPAs that require rendering (look at tools with headless Chrome integration), or you need detection as part of a broader vulnerability assessment (Nuclei or similar scanners include technology detection alongside actual security checks).