Back to Articles

Mining BigQuery for Security Gold: How Commonspeak Generates Wordlists from Billions of Records

[ View on GitHub ]

Mining BigQuery for Security Gold: How Commonspeak Generates Wordlists from Billions of Records

Hook

While most security professionals still rely on manually curated wordlists from 2015, billions of Certificate Transparency logs and web crawl records are sitting in public BigQuery datasets, waiting to reveal exactly what subdomains and paths attackers should target in 2024.

Context

Content discovery is the foundation of security testing. Whether you're a penetration tester probing for hidden admin panels or a bug bounty hunter hunting for forgotten subdomains, your wordlist determines what you find. For years, the industry has relied on manually curated lists—collections compiled from personal experience, leaked databases, and community contributions. Tools like DirBuster and SecLists popularized this approach, but they share a fundamental weakness: they're backward-looking snapshots that quickly become stale.

The web evolves rapidly. New frameworks introduce new default paths. CDN providers create predictable subdomain patterns. Developers adopt new naming conventions. Yet our wordlists remain frozen in time, missing emerging patterns until someone manually adds them. Commonspeak takes a radically different approach: instead of relying on human curation, it queries massive public datasets in Google BigQuery—Certificate Transparency logs, HTTPArchive crawls, StackOverflow discussions, and HackerNews comments—to extract patterns from billions of real-world records. The result is wordlists that reflect current internet reality, not past assumptions.

Technical Insight

Shell Scripts

Google BigQuery

Run cert-transparency.sh

Create temp dataset

Extract subdomains/dirs

Extract terms

Extract patterns

Extract paths

JSON results

JSON results

JSON results

JSON results

Parse fields

Top N by frequency

User Executes Script

Certificate Transparency Logs

StackOverflow Dataset

HackerNews Dataset

HTTPArchive Dataset

Authentication & Setup

SQL Query Execution

jq JSON Processing

grep/sort/uniq Filtering

Ranked Wordlists TXT

System architecture — auto-generated

At its core, Commonspeak is a collection of shell scripts that orchestrate BigQuery queries and process the results into ranked wordlists. The architecture is deliberately simple: each data source gets its own script that handles authentication, query execution, result processing, and output generation. Let's examine how it extracts subdomains from Certificate Transparency logs, one of its most powerful capabilities.

The Certificate Transparency system requires Certificate Authorities to log every TLS certificate they issue to public append-only logs. Google aggregates these into a BigQuery dataset that contains hundreds of millions of certificate records. Commonspeak's cert-transparency.sh script queries this dataset to extract subdomain patterns:

#!/bin/bash

# Create a temporary dataset for processing
bq mk --dataset --default_table_expiration 3600 temp_commonspeak

# Query Certificate Transparency logs for subdomains
bq query --use_legacy_sql=false --format=json \
  --max_rows=1000000 \
  "SELECT 
    REGEXP_EXTRACT(dns_names, r'([^.]+)\\.[^.]+\\.[^.]+$') as subdomain,
    COUNT(*) as frequency
   FROM 
     \`bigquery-public-data.certificate_transparency.all_cert_names\`,
     UNNEST(SPLIT(dns_names, ',')) as dns_names
   WHERE 
     dns_names NOT LIKE '*.%'
     AND dns_names NOT LIKE '%*%'
   GROUP BY subdomain
   ORDER BY frequency DESC" \
  | jq -r '.[] | .subdomain' \
  | grep -v '^null$' \
  | sort | uniq > subdomains.txt

# Take top 10,000 entries
head -n 10000 subdomains.txt > subdomains-top10k.txt

This approach reveals several architectural decisions worth understanding. First, the query uses UNNEST(SPLIT()) to handle certificates that cover multiple domains—a single wildcard certificate might list dozens of subdomains, and the script treats each as a separate signal. Second, it explicitly filters out wildcard patterns (*.example.com) because they don't represent actual subdomains. Third, the REGEXP_EXTRACT isolates just the subdomain portion, normalizing entries across different domains.

The frequency-based ranking is crucial. Rather than treating all observed subdomains equally, Commonspeak prioritizes those that appear most often across the entire certificate database. If api appears in 500,000 certificates while internal-dev-staging-v2 appears in three, the former gets higher priority. This creates wordlists optimized for coverage—you test the most likely paths first.

The HTTPArchive integration follows similar principles but queries a different BigQuery dataset that contains HAR (HTTP Archive) files from millions of web page crawls. The httparchive.sh script extracts directory paths and filenames from actual HTTP requests:

SELECT 
  REGEXP_EXTRACT(url, r'https?://[^/]+/([^/?#]+)') as path_segment,
  COUNT(*) as frequency
FROM 
  `httparchive.summary_requests.2024_01_01_*`
WHERE 
  url LIKE '%/%.%'
GROUP BY path_segment
HAVING frequency > 100
ORDER BY frequency DESC
LIMIT 50000

This query's HAVING clause filters out rare paths (fewer than 100 occurrences) to reduce noise from site-specific content. The date-based table pattern (2024_01_01_*) lets you query specific crawl periods, which is valuable for tracking how web patterns evolve over time. You could generate quarterly wordlists and diff them to identify emerging technologies.

The shell-based architecture makes Commonspeak remarkably extensible. Want to extract API endpoints mentioned in StackOverflow answers? Add a new script that queries bigquery-public-data.stackoverflow.posts_answers and searches for URL patterns. Need to filter results by TLD to target only .gov subdomains? Modify the WHERE clause. The entire data pipeline is transparent and hackable—there's no black box compilation or complex ORM abstraction hiding the logic.

One underappreciated feature is the jq-based post-processing pipeline. Raw BigQuery results come back as JSON, and the scripts use jq to extract fields, filter results, and transform data before writing final wordlists. This enables sophisticated filtering without modifying SQL queries. For example, you might pipe results through jq 'select(.frequency > 1000)' to create an ultra-high-confidence wordlist, or use jq -r '.subdomain | ascii_downcase' to normalize casing.

Gotcha

The biggest gotcha is cost and quota management. While Google provides a generous BigQuery free tier (1TB of queries per month), Commonspeak's queries can easily exceed this when processing datasets like Certificate Transparency logs. A full scan of the all_cert_names table processes hundreds of gigabytes, and at $5 per TB after the free tier, costs accumulate quickly if you're not careful. The scripts don't include cost estimation or optimization—you're responsible for understanding BigQuery pricing and potentially adding LIMIT clauses or WHERE filters to reduce data processed.

Query execution time is another practical limitation. Certificate Transparency queries can take 10-15 minutes to complete, HTTPArchive queries might take 5-8 minutes, and you're completely dependent on Google's infrastructure. There's no progress indicator, no partial results, and if your connection drops mid-query, you start over. This makes Commonspeak unsuitable for iterative workflows where you need quick feedback. You can't casually experiment with different query parameters during a penetration test—wordlist generation is a deliberate, advance-planning activity.

The repository also contains no pre-generated wordlists, which creates a significant barrier to entry. Unlike SecLists where you can clone and immediately start testing, Commonspeak requires GCP account setup, authentication configuration, BigQuery API enablement, and script execution before you get any value. For many security professionals, this friction is enough to abandon the tool entirely. The philosophy seems to be "teach a person to fish" rather than "provide fish," which is intellectually honest but pragmatically frustrating when you just need a wordlist for tomorrow's assessment.

Verdict

Use if: You're a security researcher who needs cutting-edge wordlists that reflect current web technologies and patterns, you're willing to invest time in GCP setup and query optimization, you need customizable wordlists for specific target profiles (e.g., only SaaS companies, only certain TLDs), or you're building automated tooling that regenerates wordlists on a schedule to capture emerging trends. Commonspeak's data-driven approach offers genuinely better coverage than static wordlists for modern web applications. Skip if: You need wordlists immediately for an active engagement, you want to avoid cloud infrastructure costs and complexity, you're testing older systems where historical wordlists are more relevant, or you're a beginner who needs curated lists with explanatory context. In those cases, grab SecLists or Assetnote's pre-generated wordlists and start testing. Commonspeak is a power tool for power users—respect the learning curve or you'll waste more time on setup than you'll save on improved coverage.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/developer-tools/pentester-io-commonspeak.svg)](https://starlog.is/api/badge-click/developer-tools/pentester-io-commonspeak)