Building Your Own Secrets Scanner: Inside EarlyBird's Pattern-Matching Architecture

Hook

American Express open-sourced their secrets detection tool in 2020, yet it has just 769 stars on GitHub. This relative obscurity masks a thoughtfully designed architecture that prioritizes extensibility over popularity.

Context

The problem of hardcoded secrets in source code predates modern version control systems, but git made it exponentially worse. Every developer knows the sinking feeling of realizing they've committed an API key to a public repository. By the time you notice, scrapers have already harvested it, and git's immutable history means the secret persists in your commit log forever.

EarlyBird emerged from American Express's internal security practices as a practical answer to this problem. Rather than building another one-size-fits-all secrets scanner, they designed a tool around the reality that different organizations have different definitions of what constitutes a secret. A financial services company cares deeply about credit card test data in unit tests; a healthcare startup needs to catch HIPAA-sensitive patterns. EarlyBird's core philosophy is configurability: it ships with sensible defaults but expects you to customize detection rules to match your threat model. This positions it differently from competitors like TruffleHog or GitLeaks, which optimize for out-of-the-box accuracy through increasingly sophisticated heuristics.

Technical Insight

EarlyBird's architecture revolves around three core components: a file scanner, a module-based detection engine, and a configurable labeling system. The scanner walks directory trees (or cloned git repositories), filters files by extension and size, and feeds content through the detection pipeline. The interesting design decisions happen in the detection engine.

The tool organizes detection logic into modules stored in .go-earlybird/modules/. Each module is a JSON file containing regex patterns, context rules, and metadata. Here's a simplified example of what a module looks like:

{
  "name": "AWS Keys",
  "patterns": [
    {
      "pattern": "AKIA[0-9A-Z]{16}",
      "caption": "AWS Access Key ID",
      "severity": "high",
      "confidence": "high"
    },
    {
      "pattern": "(?i)aws(.{0,20})?['\"][0-9a-zA-Z/+]{40}['\"",
      "caption": "AWS Secret Access Key",
      "severity": "critical",
      "confidence": "medium"
    }
  ]
}

This modularity is EarlyBird's superpower. Want to add detection for your company's proprietary API token format? Drop a new JSON file in the modules directory. Need to adjust confidence levels because you're getting too many false positives on base64-encoded content? Edit the pattern object without touching Go code.

The detection engine processes files concurrently using Go's goroutines. Each file gets dispatched to a worker that applies all module patterns using Go's regexp package. This is where you see the tradeoff between simplicity and sophistication: EarlyBird uses standard regex matching rather than entropy analysis, context-aware parsing, or machine learning. A pattern either matches or it doesn't.

To reduce false positives, EarlyBird implements a label-based filtering system. The .go-earlybird/labels.json file lets you annotate patterns with custom tags and then configure which combinations should suppress alerts:

{
  "labels": [
    {
      "label": "test-data",
      "patterns": ["test/", "mock", "fixture"],
      "file-patterns": [".*_test\\.go$", ".*\\.spec\\.js$"]
    }
  ],
  "suppress": [
    {
      "when": ["test-data", "low-confidence"],
      "action": "ignore"
    }
  ]
}

This system acknowledges a fundamental truth about secrets scanning: context matters enormously. A hardcoded password in production code is a critical vulnerability; the same pattern in a documented example or test fixture might be acceptable. By making these decisions explicit and configurable, EarlyBird pushes policy into data rather than code.

The CLI interface supports multiple scan modes. You can run a basic scan with earlybird scan --path ./my-repo, integrate it into pre-commit hooks using earlybird --git-diff, or stand up a REST API server for centralized scanning. The API mode is particularly interesting for organizations wanting to scan pull requests in CI/CD without giving every pipeline runner full repository access:

# Start API server
earlybird server --port 8080

# Submit scan request
curl -X POST http://localhost:8080/scan \
  -H "Content-Type: application/json" \
  -d '{"git_url": "https://github.com/user/repo.git", "branch": "main"}'

The output format is JSON-structured, making it straightforward to parse in CI systems or feed into security dashboards. Each finding includes file path, line number, matched pattern, severity, and confidence level—everything you need to triage and prioritize remediation.

One clever architectural choice: EarlyBird separates the scanning logic from the configuration entirely. The core Go binary knows nothing about AWS keys or credit card patterns—all domain knowledge lives in JSON configuration. This means updates to detection rules don't require recompiling or redeploying the tool. In a security context where new token formats and leaked secret patterns emerge constantly, this separation of concerns has real operational value.

Gotcha

EarlyBird's regex-based approach is both its strength and limitation. Regular expressions are fast, deterministic, and auditable, but they lack semantic understanding. The tool will flag any 16-character string that looks like an AWS access key, even if it's a commented-out example in documentation or a randomly generated string that happens to match the pattern. High false positive rates are the Achilles heel of pattern-matching scanners, and while EarlyBird's label system helps, you'll still spend significant time tuning rules for your codebase.

The tool also shows its age in terms of community momentum. With 769 stars and relatively infrequent commits, this isn't a thriving open-source project with rapid feature development. The last significant update added support for additional CWE categories, but you won't find cutting-edge features like ML-based entropy detection, semantic code analysis, or integration with secret management platforms. If you adopt EarlyBird, expect to maintain your own fork or work within the constraints of what exists today. For teams wanting a vendor-supported, continuously evolving solution, commercial alternatives like GitGuardian or GitHub Advanced Security offer better long-term prospects—at the cost of vendor lock-in and ongoing licensing fees.

Verdict

Use EarlyBird if you need a self-hosted, highly customizable secrets scanner that your security team can tune precisely to your organization's threat model, especially if you're already running Go services and want something you can fork and extend. It's ideal for enterprises with mature security practices who understand their false positive tolerance and are willing to invest engineering time in configuration. The American Express pedigree means the architecture is solid even if the community is small. Skip it if you want an active open-source community, need state-of-the-art detection accuracy out of the box, or prefer a commercial solution with vendor support. Also skip if you're a small team without dedicated security engineering—you'll spend more time tuning patterns than you'll save from catching secrets. In that case, GitLeaks or TruffleHog offer better defaults with less configuration overhead.

Building Your Own Secrets Scanner: Inside EarlyBird's Pattern-Matching Architecture

Building Your Own Secrets Scanner: Inside EarlyBird's Pattern-Matching Architecture

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

Building Your Own Secrets Scanner: Inside EarlyBird's Pattern-Matching Architecture

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// RELATED

4D Gaussian Splatting: How Hexplane Factorization Makes Real-Time Dynamic Scene Rendering Possible

Honcho: The Peer Memory Graph That Replaces RAG for Long-Running Agents

NocoDB: The Self-Hosted Database That Speaks Spreadsheet

Big List of Naughty Strings: The Test Dataset That Breaks Your Input Validation

4D Gaussian Splatting: How Hexplane Factorization Makes Real-Time Dynamic Scene Rendering Possible

Honcho: The Peer Memory Graph That Replaces RAG for Long-Running Agents

NocoDB: The Self-Hosted Database That Speaks Spreadsheet

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]