Inside GitHub's Secret Detection Engine: A Pattern Library for Catching Credentials in Code
Hook
A single misplaced AWS key in a public repository can cost your company $50,000 in cloud fees within 24 hours. GitHub Advanced Security stops this with regex patterns that scan every commit—but the default patterns only catch about 60% of real-world secrets.
Context
The credential leak problem has plagued software development since the first developer committed an API key to version control. Traditional approaches—pre-commit hooks, manual code reviews, periodic audits—fail because they're reactive, inconsistent, or come too late in the development cycle. GitHub Advanced Security introduced Secret Scanning in 2020 to solve this at the platform level, automatically flagging commits containing credentials before they reach production.
But here's the challenge: secret formats vary wildly across vendors. An AWS access key looks nothing like a Stripe API token, which looks nothing like a Base64-encoded JWT, which looks nothing like a PGP private key. GitHub's built-in patterns cover common cases, but organizations using niche SaaS tools, legacy systems, or proprietary authentication schemes need custom patterns. The advanced-security/secret-scanning-custom-patterns repository fills this gap—it's a community-maintained library of battle-tested regex patterns that extend GitHub's native detection capabilities. This isn't just a collection of regex; it's a masterclass in practical security pattern engineering.
Technical Insight
The repository organizes patterns into logical categories: generic secrets (API keys, passwords, tokens), vendor-specific credentials (AWS, Stripe, SendGrid, DataDog), PII (SSNs, credit cards, IBANs), and cryptographic material (RSA keys, JWTs). Each pattern follows GitHub's custom pattern format, which requires regex with specific capturing groups and optional metadata like entropy thresholds.
Let's examine a sophisticated pattern for detecting Azure Storage Account Keys. These keys are Base64-encoded strings exactly 88 characters long, but you can't just match any 88-character Base64 string—that would flag legitimate data, test fixtures, and documentation examples. Here's how the pattern achieves precision:
(?i)(?:DefaultEndpointsProtocol=https;AccountName=([a-z0-9]{3,24});AccountKey=([A-Za-z0-9+/]{86}==);EndpointSuffix=core\.windows\.net)
This pattern uses multiple techniques to reduce false positives. The (?i) case-insensitive flag handles variations in casing. The lookahead requires the full connection string context—not just the key, but the surrounding DefaultEndpointsProtocol, AccountName, and EndpointSuffix components. The account name must be 3-24 lowercase alphanumeric characters (Azure's actual constraint). The key itself is 86 Base64 characters followed by == padding. This context-aware approach dramatically reduces false positives compared to naive Base64 matching.
Entropy-based detection represents another sophisticated technique. Consider the generic password detection pattern:
(?i)(?:password|passwd|pwd)\s*[=:]\s*["']?([a-zA-Z0-9!@#$%^&*()_+\-=\[\]{};':"\\|,.<>\/?]{12,})['"\s]?
This pattern finds variable assignments or configuration entries labeled as passwords, but it's paired with an entropy requirement (typically 3.5+ bits per character) to filter out obviously fake values like password123 or changeme. The entropy calculation analyzes character distribution—random secrets have high entropy, while human-readable placeholders don't. GitHub's custom pattern format supports this through the additional_pattern field:
{
"name": "High-Entropy Password",
"regex": "(?i)(?:password|passwd|pwd)\\s*[=:]\\s*[\"']?([a-zA-Z0-9!@#$%^&*()_+\\-=\\[\\]{};':\"\\\\|,.<>\\/?]{12,})['\"\\s]?",
"target": "variable_assignment",
"min_secret_length": 12,
"additional_match": {
"entropy": 3.5
}
}
The repository also handles multi-line secrets, which traditional line-based scanning misses. RSA private keys span dozens of lines between -----BEGIN PRIVATE KEY----- and -----END PRIVATE KEY----- markers. The pattern uses the dotall flag and captures the entire key block:
-----BEGIN (RSA|OPENSSH|DSA|EC|PGP) PRIVATE KEY-----([\s\S]{100,})-----END (RSA|OPENSSH|DSA|EC|PGP) PRIVATE KEY-----
Vendor-specific patterns demonstrate deep knowledge of authentication systems. The SendGrid API key pattern, for example, knows that keys start with SG. followed by exactly 22 URL-safe Base64 characters, a period, and 43 more Base64 characters:
SG\.[A-Za-z0-9_-]{22}\.[A-Za-z0-9_-]{43}
This specificity means zero false positives—if something matches this pattern, it's almost certainly a real SendGrid key. Compare this to generic API key patterns that might match UUIDs, session tokens, or randomly generated identifiers.
The patterns also account for encoding variations. Secrets appear in source code as plain text, Base64-encoded, hex-encoded, or URI-encoded. The AWS Access Key pattern includes variations:
(?:A3T[A-Z0-9]|AKIA|AGPA|AIDA|AROA|AIPA|ANPA|ANVA|ASIA)[A-Z0-9]{16}
This catches the standard unencoded format. Separate patterns detect Base64-wrapped versions by looking for the encoded prefix patterns. This multi-pattern approach increases coverage without sacrificing precision.
Gotcha
The fundamental limitation is inherent to regex-based detection: you're playing an endless cat-and-mouse game with false positives and false negatives. Tighten a pattern to reduce false positives, and you'll miss legitimate secrets that don't match the expected format exactly. Loosen it to catch more variations, and you'll drown in alerts about test fixtures, code comments, and documentation examples. A pattern designed for production AWS keys might miss keys with unusual formatting, keys split across multiple variables, or keys constructed dynamically at runtime.
Second, these patterns are definitions only—they're useless without GitHub Advanced Security or a compatible scanning infrastructure. You can't just clone this repo and start scanning your codebase. You need a GitHub Enterprise license with Advanced Security enabled, or you need to adapt these patterns to another tool like Gitleaks or TruffleHog (which requires manual translation since pattern formats differ). Organizations on GitHub Free or GitHub Team can't use custom patterns at all; they're limited to GitHub's built-in pattern set. Additionally, the patterns require ongoing maintenance as vendors change authentication systems, introduce new secret formats, or rotate key structures. A pattern that worked perfectly six months ago might miss newly issued credentials with updated formats.
Verdict
Use if: You're running GitHub Advanced Security and need coverage beyond the default patterns—especially if you use niche SaaS vendors, have compliance requirements around PII detection, or want to learn how to craft effective security regex. This repository provides immediate value for Enterprise customers and serves as an excellent education in pattern engineering techniques like entropy-based filtering, context-aware matching, and false positive mitigation. Skip if: You're on GitHub Free/Team (custom patterns aren't available), you need an actual scanning tool rather than just pattern definitions, or you're looking for guaranteed zero false positives (impossible with regex-based approaches). If you need vendor-agnostic secret scanning outside GitHub's ecosystem, tools like Gitleaks or TruffleHog provide both patterns and scanning infrastructure in one package.