Building a Secret Scanning Arsenal: Inside GitHub’s Custom Pattern Library
Hook
GitHub’s default secret scanner detects over 200 token types, yet the average enterprise leaks credentials it doesn’t even know to look for—internal API keys, vendor-specific tokens, and proprietary authentication schemes that fall through the cracks.
Context
Secret scanning has evolved from a nice-to-have to a critical security control. After high-profile breaches traced back to hardcoded AWS keys and leaked database credentials in public repositories, platforms like GitHub built automated detection into their core offering. GitHub’s native secret scanning covers major providers—AWS, Azure, Stripe, SendGrid—but every organization has unique secrets: internal API keys with custom formats, legacy system credentials, third-party services outside the mainstream radar.
This gap birthed the advanced-security/secret-scanning-custom-patterns repository. Maintained by GitHub’s security team, it’s a battle-tested collection of regex patterns that extend GitHub Advanced Security beyond its defaults. Rather than starting from scratch when you need to detect DataDog API keys embedded in configuration files or IBAN numbers accidentally committed to finance automation scripts, this repository provides production-ready patterns with validation tests and false positive mitigation baked in. It represents years of collective experience detecting secrets at scale, packaged as reusable YAML configurations.
Technical Insight
The repository’s architecture centers on YAML pattern definitions that GitHub Advanced Security ingests directly. Each pattern file combines multiple elements: the detection regex itself, test cases for validation, and metadata controlling behavior like entropy checking and hyphen handling. Here’s a simplified example of how a custom pattern for JWT tokens is structured:
patterns:
- name: JWT Token
type: jwt_token
regex:
version: 0.1
pattern: |
(?:^|[\s'"`])(
eyJ[A-Za-z0-9_-]{10,}\.eyJ[A-Za-z0-9_-]{10,}\.[A-Za-z0-9_-]{10,}
)(?:$|[\s'"`])
start: |
eyJ
end: |
[A-Za-z0-9_-]{10,}
additional_match:
- "\\."
test:
data: |
token = "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiaWF0IjoxNTE2MjM5MDIyfQ.SflKxwRJSMeKKF2QT4fwpMeJf36POk6yJV_adQssw5c"
start_offset: 9
end_offset: 188
This structure demonstrates several architectural decisions that make these patterns production-ready. First, the regex uses anchoring with boundary conditions—(?:^|[\s'"])—to avoid matching JWT-like strings inside longer base64 sequences. Second, the startandend` markers help GitHub’s scanner quickly identify candidate regions before applying the full regex, a performance optimization crucial when scanning repositories with millions of lines. Third, the embedded test case ensures the pattern works before deployment; GitHub validates these during pattern upload.
The repository’s false positive mitigation strategies reveal deeper sophistication. Consider the common password pattern for configuration files:
patterns:
- name: Generic Password in Config
type: generic_password
regex:
pattern: |
(?i)(?:password|passwd|pwd)\s*[=:]\s*['"]?(?!\$\{|\$\(|<|null|true|false)([^\s'"]{8,})['"]?
entropy: 3.5
additional_not_match:
- "(?i)(example|sample|test|demo|placeholder)"
- "(?i)(password123|changeme|default)"
Here, the negative lookahead (?!\$\{|\$\(|<|null|true|false) prevents matching variable interpolations (${PASSWORD}), XML tags, or literal boolean values—common false positives in infrastructure-as-code files. The entropy: 3.5 threshold requires matched strings to have minimum Shannon entropy, filtering out low-complexity strings like “password” or “admin123”. The additional_not_match array explicitly excludes documentation examples that would otherwise trigger alerts.
The repository organizes patterns by category: common/ for generic secrets, vendor/ for service-specific credentials, pii/ for personally identifiable information. This modularity lets teams cherry-pick relevant patterns rather than adopting everything. An e-commerce platform might prioritize credit card and SSN patterns from pii/, while a fintech startup would focus on IBAN and SWIFT codes. The vendor directory covers dozens of services—AWS access keys (beyond GitHub’s defaults), Okta API tokens, DataDog application keys, Shopify private app passwords—each with format-specific regex tuned to that service’s credential structure.
One particularly clever pattern targets connection strings with embedded credentials, a common mistake in database configuration:
regex:
pattern: |
(?i)(mongodb(?:\+srv)?|postgres(?:ql)?|mysql|sqlserver)://
(?!test|example|user|admin|root:password)([^:]+):([^@\s]{8,})@
This captures the protocol, excludes obvious test credentials via negative lookahead, then extracts username and password components. It demonstrates context-aware matching—understanding that secrets in URI format require different detection logic than standalone API keys. The repository includes similar specialized patterns for RSA private keys (matching PEM headers), JWT tokens (three base64 segments separated by dots), and OAuth tokens (specific prefixes like ghp_ for GitHub personal access tokens).
Integration with GitHub Advanced Security happens through the repository’s Security settings. After enabling secret scanning, administrators navigate to “Custom patterns” and paste YAML definitions. GitHub compiles the regex, validates test cases, and begins scanning both historical commits and new pushes. When a match occurs, GitHub creates a security alert, optionally blocking the push if configured. This tight integration means patterns update across all protected repositories simultaneously—adding a new vendor token pattern instantly protects hundreds of repos without per-repo configuration.
Gotcha
Regex-based secret detection fundamentally struggles with context. These patterns will flag secrets in test files, documentation, and example code unless you implement path-based exclusions or rely on GitHub’s machine learning layer to suppress low-confidence matches. A pattern detecting database passwords can’t distinguish between password=prod_secret123 in production config versus password=example_password in a README. While the repository includes entropy thresholds and negative lookaheads to mitigate this, expect an initial wave of false positives requiring manual triage and pattern tuning.
The dependency on GitHub Advanced Security is absolute. These patterns are YAML configurations specific to GitHub’s secret scanning engine—they won’t work with GitLab, Bitbucket, or standalone tools without significant reformatting. If you’re evaluating secret scanning across multiple platforms or need coverage beyond GitHub, you’ll need to maintain parallel pattern sets or choose a platform-agnostic tool. Additionally, pattern maintenance becomes an ongoing burden. When AWS changes their access key format or a vendor introduces a new authentication scheme, your custom patterns become stale. Unlike GitHub’s built-in patterns, which GitHub maintains centrally, these custom patterns are your responsibility to monitor and update.
Verdict
Use if: You’re already running GitHub Advanced Security and need detection beyond the 200+ built-in patterns—internal API key formats, niche vendor services, industry-specific PII like healthcare identifiers or financial account numbers. Also use if you’re establishing a security baseline for a large GitHub Enterprise deployment and want to adopt community-vetted patterns rather than building from scratch, saving weeks of regex development and testing. Skip if: You don’t have GitHub Advanced Security (these patterns are useless without it), you’re just starting with secret scanning and should first evaluate GitHub’s native coverage before adding complexity, you need multi-platform scanning across GitHub/GitLab/Bitbucket (choose TruffleHog or GitGuardian instead), or your team lacks regex expertise to maintain and tune patterns as false positives emerge. This repository is a force multiplier for mature GitHub security programs, not a starting point for secret scanning beginners.