Back to Articles

Pagodo: Weaponizing 7,300+ Google Dorks for Automated OSINT Reconnaissance

[ View on GitHub ]

Pagodo: Weaponizing 7,300+ Google Dorks for Automated OSINT Reconnaissance

Hook

Google knows where all the skeletons are buried—exposed configuration files, credential leaks, vulnerable endpoints—and pagodo automates asking Google to show you exactly where they are on your target domains.

Context

Google dorking has been a cornerstone of OSINT and security reconnaissance since the early 2000s, when security researcher Johnny Long began cataloging specialized search queries that expose sensitive information. These "dorks"—crafted search strings using Google's advanced operators like site:, inurl:, filetype:, and intitle:—can surface exposed admin panels, leaked credentials, backup files, and vulnerable applications. The Google Hacking Database (GHDB) on Exploit-DB now contains over 7,300 of these queries, categorized by attack vector.

But manually running thousands of dorks against target domains is soul-crushing work. Copy a dork, paste it into Google, append your target domain, scan results, repeat 7,299 more times. Security researchers and bug bounty hunters needed automation, but Google's aggressive bot detection makes scraping search results a technical challenge. Early tools broke frequently as Google updated its defenses. Pagodo emerged as a purpose-built solution that not only scrapes fresh dorks from GHDB but executes them against targets with intelligent rate limiting, proxy rotation, and structured output—turning what was once a weekend manual task into a scriptable reconnaissance workflow.

Technical Insight

Rate Limiting

Scrapes dork queries

Saves 14 categories

Loads dork queries

Executes searches

Routes requests

Rate-limited queries

Returns URLs

Structured results

Saves findings

Controls timing

GHDB Scraper

Exploit-DB GHDB

Dorks Database

JSON/TXT

Pagodo Core

yagooglesearch

Library

Proxy Pool

HTTP/SOCKS5

Google Search API

Results Storage

JSON/TXT

Min/Max Delay

30-60s configurable

System architecture — auto-generated

Pagodo's architecture separates concerns into two distinct tools with complementary purposes. The ghdb_scraper.py module handles data acquisition by scraping Exploit-DB's GHDB pages, extracting dork queries, categorizing them into 14 types (footholds, file vulnerabilities, error messages, credentials, etc.), and persisting them to JSON or text files. This separation means you control when to refresh your dork database—you might scrape monthly to catch new dorks while reusing the cached dataset for daily scans.

The real architectural sophistication lives in pagodo.py, which orchestrates search execution through the yagooglesearch library. Unlike older tools that wrapped the googlesearch-python library (which required external proxychains for proxy support), pagodo leverages yagooglesearch's native proxy handling with round-robin rotation across multiple SOCKS5 or HTTP proxies. Here's how you'd use it programmatically:

from pagodo import Pagodo

# Initialize with your domain target and proxy list
pg = Pagodo(
    google_dorks_file='dorks/all_google_dorks.txt',
    domain='target.com',
    max_search_result_urls_to_return_per_dork=50,
    save_pagodo_results_to_json_file='results.json',
    proxies=['socks5://127.0.0.1:9050', 'http://proxy2.com:8080'],
    min_delay_seconds=30,
    max_delay_seconds=60
)

# Execute searches and get structured results
results = pg.go()

# Results are a dictionary: {dork_query: [url1, url2, ...]}
for dork, urls in results.items():
    if urls:
        print(f"Dork: {dork}")
        for url in urls:
            print(f"  Found: {url}")

The return structure is deliberately simple—a dictionary mapping each dork query to a list of discovered URLs. This makes pagodo trivially composable with other tools in your security automation pipeline. You could pipe results into nuclei for vulnerability validation, feed them to aquatone for visual reconnaissance, or filter by URL patterns to identify high-value targets.

The rate-limiting logic deserves attention because it's the difference between a successful scan and an IP ban. Pagodo implements randomized delays between searches using min_delay_seconds and max_delay_seconds parameters, creating unpredictable timing that's harder for Google to fingerprint as automated behavior. The delay randomization happens per-search, not per-proxy, so even with proxy rotation you maintain human-like search cadence:

import random
import time

delay = random.uniform(min_delay_seconds, max_delay_seconds)
time.sleep(delay)

Coupled with proxy rotation, this creates a reconnaissance workflow that can run for hours without triggering HTTP 429 responses. However, you're still walking a tightrope—too aggressive and you get blocked, too conservative and your scan takes days.

The scraper component uses BeautifulSoup to parse GHDB category pages, extracting dorks from the database table structure. Since Exploit-DB's markup could change at any time, the scraper includes some defensive parsing with try-except blocks to gracefully handle missing fields. The categorization is preserved in the output, so you can selectively run only "vulnerability" dorks or focus on "file containing passwords" if you're hunting specific exposures:

# Scrape only specific GHDB categories
python ghdb_scraper.py -c "Files Containing Passwords" \
    -c "Sensitive Directories" \
    -j password_dorks.json

One architectural decision worth noting: pagodo doesn't implement result deduplication across dorks. If two different dorks discover the same URL, it appears twice in your results dictionary. This is intentional—it preserves signal about which dorks are most effective against your target, information that gets lost if you deduplicate too aggressively. You can easily deduplicate downstream if needed:

unique_urls = set(url for urls in results.values() for url in urls)

Gotcha

The elephant in the room: pagodo fundamentally violates Google's Terms of Service. Google explicitly prohibits automated querying and result scraping, and they actively work to detect and block it. Even with perfect rate limiting and proxy rotation, you're playing cat-and-mouse with a very sophisticated adversary. Google will block IPs, present CAPTCHAs, return degraded results, or simply stop responding with HTTP 429 errors. The tool includes mitigation strategies, but there's no guarantee they'll work tomorrow—Google constantly updates bot detection.

Execution time is the other major limitation. With 7,300+ dorks and required delays of 30-60 seconds between searches to avoid blocking, a full GHDB scan takes days even in the best case. The math is unforgiving: 7,300 dorks × 45 seconds average delay = 91 hours of runtime. You can parallelize across multiple IPs with different proxy sets, but managing that infrastructure is non-trivial. In practice, most users run filtered subsets—maybe 200-500 dorks targeting specific vulnerability categories—which completes in 3-6 hours. The tool also depends entirely on Google's indexing; if your target uses aggressive robots.txt rules or isn't well-indexed, pagodo won't find anything. And finally, the yagooglesearch library that pagodo depends on is itself a scraping tool that could break when Google changes their result page structure.

Verdict

Use if: You're conducting bug bounty reconnaissance or penetration testing against large web properties where comprehensive coverage justifies multi-hour scan times, you have access to proxy infrastructure (residential proxies work best), you need structured output for integration with other security tools, and you're comfortable with TOS violations in the context of authorized security testing. Pagodo excels at automated breadth—systematically checking thousands of exposure vectors that human researchers would never manually test. Skip if: You need results in minutes not hours, you're unwilling to risk IP bans or TOS violations, your targets aren't well-indexed by Google, you're operating at a scale where Google's Custom Search API (100 free queries/day, $5/1000 after) is more cost-effective than managing proxy infrastructure, or you need legally defensible reconnaissance methods. For production security monitoring, the API or commercial OSINT platforms are safer bets despite higher costs.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/data-knowledge/opsdisk-pagodo.svg)](https://starlog.is/api/badge-click/data-knowledge/opsdisk-pagodo)