> your AI agent picks dependencies from memory; give it dated facts — try starlog.dev ↗ vet your agent's deps ↗ vibe-coding is fine. vibe-importing isn’t. — try starlog.dev ↗ vibe-importing isn’t fine ↗ your agent has never seen your private packages — try starlog.dev ↗ facts for private packages ↗ a linter for the dependencies your AI agent picks — try starlog.dev ↗ a linter for agent deps ↗

Back to Articles

Secrets-Patterns-DB: The Open-Source Pattern Library Powering Secret Detection at Scale

[ View on GitHub ]

Secrets-Patterns-DB: The Open-Source Pattern Library Powering Secret Detection at Scale

Hook

Most secret scanning tools detect fewer than 700 credential patterns. Meanwhile, the average enterprise application integrates with 254 SaaS services—each with their own API keys, tokens, and authentication schemes. That's a dangerous math problem.

Context

Secret scanning has become table stakes for security-conscious development teams. Tools like TruffleHog and Gitleaks scan codebases for accidentally committed credentials, preventing the nightmare scenario where an AWS key leaks to a public repository and racks up a six-figure bill before anyone notices.

But here's the problem: these tools are only as good as their pattern databases. TruffleHog v3 ships with around 700 patterns. Gitleaks, focused on quality over quantity, includes roughly 60 high-confidence patterns. Both numbers pale against the reality of modern software development, where applications integrate dozens of third-party services—Stripe for payments, SendGrid for email, Datadog for monitoring, Auth0 for authentication, and hundreds more. Each service has its own credential format. Miss a pattern, and you've got a blind spot. That's where secrets-patterns-db enters the picture: a community-maintained database of 1,600+ regex patterns, designed not to replace secret scanners but to feed them all.

Technical Insight

Secrets-patterns-db inverts the typical approach to secret detection tooling. Rather than building yet another scanner, it focuses on the harder problem: maintaining a comprehensive, validated database of what secrets actually look like in the wild. The architecture is deceptively simple—YAML files containing regex patterns, organized by confidence levels, with Python conversion scripts that transform these patterns into tool-specific formats.

Let's look at how patterns are structured. Here's an example from the database for detecting Slack tokens:

- name: Slack Token
  pattern: (xox[pborsa]-[0-9]{12}-[0-9]{12}-[0-9]{12}-[a-z0-9]{32})
  confidence: high
  description: Slack API token
  references:
    - https://api.slack.com/authentication/token-types

The confidence field is critical. High-confidence patterns have strict formats with low false-positive rates—like Slack tokens with their distinctive xox prefix and specific length requirements. Medium and low confidence patterns cast wider nets, catching secrets with less distinctive formats at the cost of more noise. This categorization lets security teams tune their scanners based on their tolerance for false positives versus detection coverage.

The conversion script architecture shows the tool-agnostic philosophy in action. The convert.py script takes these YAML patterns and transforms them for different downstream tools:

def convert_to_gitleaks(patterns):
    gitleaks_rules = []
    for pattern in patterns:
        rule = {
            'description': pattern['name'],
            'regex': pattern['pattern'],
            'tags': ['secret', pattern.get('confidence', 'medium')]
        }
        gitleaks_rules.append(rule)
    return {'rules': gitleaks_rules}

def convert_to_trufflehog(patterns):
    # TruffleHog v3 uses a different structure
    trufflehog_detectors = []
    for pattern in patterns:
        detector = {
            'name': pattern['name'],
            'keywords': extract_keywords(pattern['pattern']),
            'regex': {pattern['name']: pattern['pattern']}
        }
        trufflehog_detectors.append(detector)
    return trufflehog_detectors

This separation of concerns means pattern contributors don't need to understand the internals of multiple scanning tools. They contribute to one canonical source, and the conversion layer handles tool-specific quirks.

The project's most important technical safeguard is ReDoS (Regular Expression Denial of Service) validation. Poorly constructed regex patterns can cause catastrophic backtracking, where the regex engine spends exponential time trying to match against input. In a secret scanner processing millions of lines of code, that's a production incident waiting to happen. The CI pipeline runs every pattern through a ReDoS detector before merging:

import re2  # Google's RE2 library, immune to ReDoS

def validate_pattern_safety(pattern):
    """Ensure regex won't cause catastrophic backtracking"""
    try:
        # Attempt compilation with RE2's stricter rules
        re2.compile(pattern)
        return True
    except re2.error:
        # Pattern has potentially dangerous constructs
        return False

This validation step is why you can safely deploy these patterns in production scanners without worrying about regex bombs taking down your CI/CD pipeline.

The pattern coverage reveals the database's true value. Beyond the usual suspects (AWS keys, GitHub tokens, Stripe keys), it includes patterns for services that individual scanning tools often miss: PayPal Braintree tokens, Telegram bot tokens, HashiCorp Vault tokens, Dynatrace API keys, and hundreds more. The database captures the long tail of credential formats that security teams discover through painful incident response rather than proactive detection.

Integrating the database into a custom scanner is straightforward. Load the YAML, compile the patterns, and scan:

import yaml
import re

def load_patterns(confidence_level='high'):
    with open('db/secrets-patterns.yaml') as f:
        patterns = yaml.safe_load(f)
    return [p for p in patterns if p.get('confidence') == confidence_level]

def scan_content(content, patterns):
    findings = []
    for pattern_def in patterns:
        pattern = re.compile(pattern_def['pattern'])
        for match in pattern.finditer(content):
            findings.append({
                'type': pattern_def['name'],
                'value': match.group(),
                'position': match.span(),
                'confidence': pattern_def['confidence']
            })
    return findings

The beauty of this approach is extensibility. Need to add patterns for your company's internal authentication tokens? Add them to the YAML. Want to filter patterns by service category? Extend the metadata. The database becomes your team's institutional knowledge about what secrets look like, version-controlled and shareable.

Gotcha

The beta status isn't just a disclaimer—it's a real consideration for production use. Pattern quality varies because contributions come from the community without systematic validation against real-world codebases. A pattern might technically match a credential format but trigger false positives on common code constructs. For example, a loosely-defined JWT pattern might match any base64-encoded string with periods, catching legitimate data encoding in addition to actual tokens.

The regex-only approach has fundamental limitations. Context-aware secrets—like a password that only matters when combined with a specific username, or an API key that's deliberately included in documentation as a fake example—will generate false positives. The database can't distinguish between api_key = 'sk_live_abc123' in production code versus the same string in a test fixture or code comment. More sophisticated secret scanners layer entropy analysis, git history checks, and context understanding on top of pattern matching. This database gives you the patterns but not the surrounding intelligence. You'll need to build or buy that layer separately, and tuning false positive rates will require ongoing work as your codebase evolves.

Verdict

Use if: You're building a custom secret scanning pipeline and need comprehensive pattern coverage beyond what commercial tools provide; you're augmenting existing scanners like Gitleaks or TruffleHog and want to catch the long tail of credential formats they miss; you're a security team that wants version-controlled, auditable pattern management where you can add internal credential formats alongside public ones; or you're researching secret detection techniques and need a corpus of real-world patterns to analyze. Skip if: You need a production-ready scanner that runs out of the box—this is the pattern database, not the scanner itself; you require enterprise-grade accuracy with proven false-positive rates and vendor support; you're scanning highly polyglot codebases where context-aware detection matters more than pattern breadth; or you lack the security engineering resources to validate and tune patterns for your specific environment. For most teams, the sweet spot is using this database to supplement an existing scanner's patterns rather than replacing them entirely.