Automating Google Dorking: How pagodo Weaponizes GHDB for Reconnaissance
Hook
Google indexes billions of pages—including database backups, exposed admin panels, and leaked credentials. pagodo systematically searches for all of them, transforming manual reconnaissance into automated intelligence gathering.
Context
Security researchers have long used Google dorks—specialized search queries that uncover vulnerable systems and sensitive data indexed by Google’s crawlers. The Google Hacking Database (GHDB), maintained by Offensive Security, catalogs these queries across 14 categories, from exposed login portals to files containing passwords. The problem? Manually executing dorks through Google’s web interface is tedious, inconsistent, and doesn’t scale for bug bounty hunters or penetration testers working under time constraints.
pagodo emerged to bridge this gap, automating both the collection of fresh GHDB dorks and their systematic execution against target domains. Before tools like pagodo, security professionals either copied dorks manually from Exploit-DB or wrote brittle scraping scripts that broke with every Google interface change. The tool’s name—Passive Google Dork—reflects its focus on reconnaissance using publicly indexed data rather than active scanning techniques that touch target infrastructure directly.
Technical Insight
pagodo implements a two-stage architecture that separates dork acquisition from dork execution, giving users flexibility in how they maintain and deploy their reconnaissance workflows. The first component, ghdb_scraper.py, scrapes Exploit-DB’s GHDB to extract current dorks organized into 14 categories—from ‘Footholds’ to ‘Advisories and Vulnerabilities.’ This separation means you can refresh your dork database independently without running searches, keeping your intelligence current.
The scraper can be used as both a standalone script and importable module. When used programmatically, it returns a structured dictionary:
import ghdb_scraper
dorks = ghdb_scraper.retrieve_google_dorks(save_all_dorks_to_file=True)
print(dorks["total_dorks"]) # Number of dorks extracted
print(dorks["category_dict"][9]["category_name"]) # 'File Containing Passwords'
The second component, pagodo.py, executes these dorks against Google Search. Under the hood, it uses the yagooglesearch library rather than the older googlesearch package—a crucial architectural decision that enables native proxy support without external wrappers like proxychains4. This matters because large-scale dorking inevitably triggers Google’s anti-bot measures, making proxy rotation essential for serious reconnaissance work.
pagodo supports HTTP, HTTPS, and SOCKS5 proxies in round-robin fashion via a comma-separated list. Rate limiting is configurable with randomized wait times between searches, though the README acknowledges this won’t prevent detection indefinitely. The tool can also be imported as a module for integration into larger OSINT pipelines:
import pagodo
pg = pagodo.Pagodo(
google_dorks_file="dorks.txt",
domain="github.com",
max_search_result_urls_to_return_per_dork=3,
save_pagodo_results_to_json_file=None, # Auto-generates filename
save_urls_to_file=None
)
results = pg.go()
The returned dictionary provides structured data with timestamps and per-dork URL collections:
{
"dorks": {
"inurl:admin": {
"urls_size": 3,
"urls": [
"https://github.com/marmelab/ng-admin",
"https://github.com/settings/admin",
"https://github.com/akveo/ngx-admin"
]
}
},
"initiation_timestamp": "2021-08-27T11:35:30.638705",
"completion_timestamp": "2021-08-27T11:36:42.349035"
}
This data structure makes it trivial to feed results into vulnerability scanners, notification systems, or compliance reporting tools. The module approach also lets you programmatically filter dork categories—for example, only running category 12 (login portals) dorks during an authentication testing phase.
The architecture’s elegance lies in its composability. You’re not locked into a monolithic tool; you can run ghdb_scraper.py weekly to maintain fresh dork files, then selectively execute subsets through pagodo.py based on engagement scope. The JSON output from the scraper preserves full metadata about each dork, including GHDB IDs and descriptions, enabling custom filtering logic outside the tool itself.
Gotcha
pagodo’s fundamental limitation is openly stated in its README: it violates Google’s Terms of Service. Automated scraping of search results isn’t something Google tolerates at scale, and the tool’s documentation explicitly notes that users assume all legal responsibility. This isn’t just theoretical—running pagodo can result in IP bans, CAPTCHA challenges, or account restrictions if you’re authenticated. The proxy rotation feature helps, but Google’s anti-bot systems are sophisticated enough to detect patterns even across distributed IPs.
Rate limiting with randomized delays only delays detection, it doesn’t prevent it. If you need to execute large numbers of GHDB dorks for a comprehensive assessment, expect to invest in residential proxy infrastructure or accept that your reconnaissance will be spread across days or weeks. The tool also inherits limitations from Google’s indexing itself—results depend on what Google has crawled, how recently it crawled, and whether sites use robots.txt to prevent indexing. GHDB dorks vary widely in quality too; many target outdated software versions or yield false positives, meaning you’ll spend time manually validating results. For legally compliant operations, Google’s Custom Search API is the appropriate choice, though it has query limitations and lacks GHDB integration, making it impractical for the use cases pagodo targets.
Verdict
Use pagodo if you’re conducting authorized security assessments or bug bounty reconnaissance where you have permission to enumerate targets and can accept the Terms of Service risk. It’s particularly valuable when you need systematic coverage of known attack patterns during initial enumeration phases, especially if you already have proxy infrastructure. The dual-module design makes it ideal for integration into automated OSINT pipelines where you need programmatic access to both dork collection and execution. Skip it if you require legal compliance with search engine terms (use Google’s official Custom Search API instead), if your target scope is small enough that manual dorking is faster, or if you’re risk-averse about potential IP bans or account restrictions. For production security operations at organizations with strict compliance requirements, commercial OSINT platforms with proper API agreements are the safer choice despite higher costs.