Back to Articles

Commonspeak2: Mining Google BigQuery to Generate Wordlists from Real Internet Traffic

[ View on GitHub ]

Commonspeak2: Mining Google BigQuery to Generate Wordlists from Real Internet Traffic

Hook

The most effective subdomain wordlist isn't curated by security experts—it's extracted from 500 million real websites crawled by HTTPArchive. Commonspeak2 makes that dataset queryable.

Context

Security researchers and bug bounty hunters have relied on static wordlists for decades. Tools like dirb, dirbuster, and gobuster all depend on manually curated text files containing common file paths, subdomain names, and API endpoints. The problem? These lists become stale within months. A wordlist created in 2020 won't include "/.well-known/apple-app-site-association" or "/api/v1/graphql" if those patterns emerged later.

Commonspeak2 flips this model by treating wordlist generation as a data engineering problem. Instead of manually collecting paths from pentests and GitHub repos, it queries Google BigQuery's massive public datasets—GitHub's entire commit history, HTTPArchive's crawls of millions of websites, and HackerNews's link submissions. By extracting patterns from billions of real-world data points, it generates wordlists that reflect current technology usage. When a new JavaScript framework gains popularity and introduces new routing conventions, those patterns automatically appear in the next month's dataset. It's reproducible security research infrastructure.

Technical Insight

BigQuery Layer

Processing Pipeline

auth credentials

execute SQL query

HTTPArchive requests

GitHub commits

HackerNews data

raw rows

matched strings

cleaned data

unique entries

CLI Entry Point

BigQuery Client

Public Datasets

Query Results Stream

Regex Pattern Extractor

Normalization Logic

Deduplication Map

Wordlist Output Files

System architecture — auto-generated

At its core, Commonspeak2 is a specialized BigQuery client that runs pattern-extraction SQL queries against public datasets and post-processes the results into wordlists. The architecture is straightforward: connect to BigQuery using service account credentials, execute predefined queries that scan terabytes of data, stream results through regex filters and normalization logic, then deduplicate and write to files.

The tool's real intelligence lies in its SQL queries and extraction patterns. For subdomain enumeration, it queries HTTPArchive's request table to extract all unique hostnames from billions of HTTP requests. For content discovery, it mines GitHub commit diffs to find file paths that actually existed in real projects. Here's the conceptual flow for subdomain extraction:

// Simplified example of how Commonspeak2 processes BigQuery results
query := `
  SELECT DISTINCT NET.HOST(url) as hostname
  FROM `httparchive.requests.2024_01_01_*`
  WHERE NET.HOST(url) IS NOT NULL
  LIMIT 1000000
`

// Stream results from BigQuery
it, err := client.Query(query).Read(ctx)
for {
    var row struct {
        Hostname string
    }
    err := it.Next(&row)
    if err == iterator.Done {
        break
    }
    
    // Extract subdomain using regex
    subdomain := extractSubdomain(row.Hostname)
    if subdomain != "" && isValid(subdomain) {
        subdomains[subdomain] = true
    }
}

// Write deduplicated results
for subdomain := range subdomains {
    fmt.Fprintln(outputFile, subdomain)
}

The "deleted files" feature demonstrates clever security-focused thinking. It queries GitHub's commit history specifically for file deletions, reasoning that paths developers intentionally removed might indicate sensitive files (backup files, configuration, test credentials) that could still exist on production servers if cleanup was incomplete. This turns GitHub into an inadvertent database of potentially vulnerable paths.

Framework route extraction goes further by understanding web framework routing syntax. For Rails applications, it parses routes.rb files from GitHub to extract patterns like "resources :users" or "get '/api/v1/posts/:id'", then converts them into actual HTTP paths with intelligent placeholder replacement. A route like "/:company/dashboard" becomes multiple concrete paths by substituting common values for ":company"—"acme", "demo", "test", "admin". This generates paths that generic wordlists would never include because they're application-specific.

The streaming architecture is critical for handling BigQuery's massive result sets. Rather than loading millions of rows into memory, Commonspeak2 processes results incrementally, maintaining only a deduplication map in memory. This allows it to handle queries that return tens of millions of paths without consuming excessive RAM.

One subtle design choice: the tool separates query execution from post-processing. You can save raw BigQuery results to JSON, then experiment with different regex patterns and normalization rules without re-running expensive queries. This iteration speed is essential when refining extraction logic, since each BigQuery query can cost $5-30 depending on the dataset size scanned.

Gotcha

The biggest gotcha is cost. BigQuery charges based on data processed, not data returned. A query that scans HTTPArchive's entire request table (multiple terabytes) can easily cost $20-50, even if you only extract 100,000 rows. The framework route extraction queries are particularly expensive because they scan GitHub's full commit history. You'll want to set BigQuery spending limits and carefully test queries on smaller date ranges before running them at full scale. The repository's README warns about this, but it's easy to accidentally run a $100 query.

Setup friction is significant. You need a Google Cloud account, a project with BigQuery API enabled, service account credentials downloaded as JSON, and the GOOGLE_APPLICATION_CREDENTIALS environment variable configured. This isn't a "git clone && go run" tool—it's infrastructure that requires cloud platform familiarity. Additionally, the project still uses Glide for dependency management instead of Go modules, which is deprecated and complicates building on modern Go toolchains.

Promised features remain unimplemented years after the initial release. The README mentions NodeJS and Tomcat route extraction, scheduled generation, and smart placeholder substitution, but none of these exist in the codebase. The framework extraction only works for Ruby on Rails. If you're testing a Node.js application and need Express.js route wordlists, this tool won't help.

Verdict

Use if: You're doing professional security research or bug bounty hunting where comprehensive, up-to-date wordlists justify the BigQuery costs, or you're building automated security tooling that needs monthly-refreshed wordlists generated from current internet patterns. The data-driven approach provides genuinely superior coverage for subdomain enumeration and content discovery compared to static lists. Skip if: You're an individual researcher on a budget (just download the pre-generated wordlists from wordlists.assetnote.io monthly instead of running queries yourself), you need framework-specific routes for anything other than Rails, or you want a zero-configuration tool without Google Cloud dependencies. For most users, consuming the outputs is smarter than running the tool directly.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/data-knowledge/assetnote-commonspeak2.svg)](https://starlog.is/api/badge-click/data-knowledge/assetnote-commonspeak2)