GitHub Dorks: Mining Public Repositories for Accidentally Leaked Secrets
Hook
Every 30 seconds, a developer commits AWS credentials to GitHub. While GitHub now scans for some secrets automatically, thousands of leaked API keys, database passwords, and private keys remain publicly accessible through simple search queries—if you know what to look for.
Context
The problem of leaked secrets in version control predates GitHub itself, but the centralization of millions of repositories on a single platform created an unprecedented security challenge. Developers frequently commit .env files, configuration with hardcoded passwords, or entire credential stores during rapid prototyping, then forget to remove them before pushing to public repositories. Even when removed in later commits, these secrets remain in git history forever unless explicitly purged.
Traditional secret scanning tools focus on regex patterns and entropy analysis—looking for strings that "look like" passwords or API keys. But GitHub's powerful search API enables a different approach: targeted queries (called "dorks," borrowed from Google hacking terminology) that search for specific file patterns, code structures, or configuration formats known to contain secrets. The github-dorks project maintains a curated collection of these search patterns and provides a simple Python wrapper to automate scanning across repositories, users, or entire organizations. It's security reconnaissance through search engine exploitation rather than code analysis.
Technical Insight
At its core, github-dorks is a thin orchestration layer around the GitHub Search API. The tool's value comes less from sophisticated code and more from its curated dork collection—a knowledge base of 40+ search patterns targeting everything from AWS credentials to Slack webhooks. Here's how the architecture works:
The main scanning logic iterates through dorks defined in github-dorks.py, constructs search queries, and handles pagination. A typical dork looks like this:
{
"name": "AWS API Key",
"query": "extension:json aws_access_key_id",
"description": "Search for AWS credentials in JSON files"
}
The tool transforms this into a GitHub code search query, scoped to your target (repository, user, or organization). For example, scanning an organization called "acme-corp" for AWS keys becomes: org:acme-corp extension:json aws_access_key_id. The github3.py library handles authentication and API communication:
from github3 import login
gh = login(token=os.environ['GITHUB_TOKEN'])
results = gh.search_code(
query=f"org:{org_name} {dork['query']}",
per_page=100
)
for result in results:
print(f"Found in {result.repository.full_name}: {result.path}")
The brilliance lies in the dork patterns themselves. Instead of complex regex, they leverage GitHub's indexed search fields. The query filename:.npmrc _auth finds npm configuration files containing authentication tokens. The pattern extension:pem private locates private key files that should never be committed. The search filename:.env DB_PASSWORD catches environment files with database credentials. Each dork encodes security domain knowledge—places developers commonly leak secrets.
Rate limiting is the primary challenge when scanning large codebases. GitHub's authenticated search API allows roughly 30 requests per minute. The tool handles this gracefully with a simple wait-and-retry mechanism:
if gh.rate_limit()['resources']['search']['remaining'] == 0:
reset_time = gh.rate_limit()['resources']['search']['reset']
sleep_duration = reset_time - time.time() + 10
print(f"Rate limit exceeded. Sleeping {sleep_duration} seconds...")
time.sleep(sleep_duration)
This approach trades speed for completeness—a scan of a large organization might take hours, but it won't miss results or crash. For CSV export, the tool uses Python's csv module to write findings with repository, path, and URL fields, enabling import into spreadsheets or SIEM systems.
The Docker containerization is straightforward but valuable for reproducibility. The Dockerfile installs Python dependencies, copies the dork list and scanning script, and sets the entrypoint to run searches. This means security teams can deploy the tool in CI/CD pipelines without worrying about Python environment conflicts:
docker run -e GITHUB_TOKEN=$TOKEN techgaun/github-dorks \
-u target-username -o results.csv
One underappreciated feature is GitHub Enterprise support. By setting the GITHUB_URL environment variable, you can run the same dorks against internal GitHub instances, making this tool useful for auditing private corporate repositories where the risk of credential exposure is arguably higher.
Gotcha
The sequential nature of API calls makes this tool painfully slow for large-scale scans. With rate limits of 30 requests per minute and 40+ dorks to check, scanning even a moderately active organization can take multiple hours. There's no parallelization, no caching of results, and no incremental scanning—every run starts from scratch. If you're scanning thousands of repositories, you'll quickly find the tool's simplicity becomes a liability.
The output lacks sophistication that security teams need for triage. Every match is reported equally—a .env.example template file with placeholder credentials triggers the same alert as a production .env with real AWS keys. There's no severity scoring, no automatic verification that found credentials are valid, and no deduplication when the same secret appears in multiple files or branches. The maintainers acknowledge the output formatting needs work, but this limitation means you'll spend significant time manually filtering false positives. For organizations with hundreds of repositories, the signal-to-noise ratio can make results overwhelming. The tool also only searches the default branch and current files—it won't find secrets that were committed then removed, which is precisely where many real leaks hide.
Verdict
Use if: You need a one-time security audit of your GitHub organization and can tolerate multi-hour scan times, you want to learn about common credential leak patterns (the dork list is excellent security education), you're conducting security research and need reproducible search queries, or you're scanning GitHub Enterprise instances where commercial tools lack support. The Docker deployment and CSV export make it viable for periodic compliance checks. Skip if: You need real-time secret detection (implement pre-commit hooks with tools like detect-secrets instead), you're securing high-velocity development teams (GitHub's built-in secret scanning or GitGuardian catch secrets at commit time), you require git history analysis to find removed-but-not-purged secrets (use truffleHog or gitleaks), or you need low false-positive rates for large organizations (commercial solutions invest heavily in validation and filtering). This tool shines for occasional audits and manual investigations, not continuous monitoring.