Archaeologit: Why Deleting Secrets From Your Repo Isn't Enough

Hook

That API key you committed and deleted three years ago? It's still in your git history, and attackers know exactly how to find it. One study found that 6% of all GitHub repositories contain at least one leaked secret across their commit history.

Context

The moment you commit sensitive data to a git repository, it becomes part of your permanent history. Developers often discover too late that they've accidentally committed passwords, API keys, or tokens. The natural instinct is to delete the file and commit again, but this approach fundamentally misunderstands how git works—every commit is preserved in the history, and anyone who clones your repository can traverse back through time to find that sensitive data.

This problem multiplies across an organization or individual portfolio. A developer might maintain dozens of repositories over years, with hundreds or thousands of commits. Manually auditing each repository's history for leaked secrets becomes impractical. Archaeologit emerged as a pragmatic solution: a simple bash script that automates the tedious process of cloning repositories and searching through their entire commit history for pattern matches. Created by Peter Jaric, it exemplifies the Unix philosophy of doing one thing well by combining standard tools (git, grep, curl) into a focused security auditing workflow.

Technical Insight

System architecture — auto-generated

Archaeologit's implementation reveals how much you can accomplish with 150 lines of shell script and a deep understanding of git internals. The core insight is that git log -p dumps the entire patch history of a repository, exposing every line ever added or removed. By piping this output through grep with carefully crafted regex patterns, you can surface historical secrets that no longer exist in the current working tree.

The script begins by using GitHub's REST API to enumerate all repositories for a given user:

USER=$1
PATTERN=$2

# Fetch all repos for the user
REPOS=$(curl -s "https://api.github.com/users/${USER}/repos?per_page=100" | \
  grep -oP '"clone_url": "\K[^"]+' )

for REPO in $REPOS; do
  REPO_NAME=$(basename "$REPO" .git)
  echo "Scanning $REPO_NAME..."
  
  git clone --quiet "$REPO" "/tmp/$REPO_NAME" 2>/dev/null
  cd "/tmp/$REPO_NAME"
  
  # Search through all commits, showing patches
  git log -p --all | grep -i "$PATTERN" && \
    echo "[FOUND] Pattern in $REPO_NAME"
    
  cd - > /dev/null
  rm -rf "/tmp/$REPO_NAME"
done

This straightforward approach clones each repository into /tmp, runs git log -p --all to generate a complete diff history across all branches, and pipes it through grep for pattern matching. The --all flag is crucial—it ensures that secrets hiding in abandoned branches or stale feature work aren't missed.

The real power comes from the regex patterns you provide. A basic search for 'password' will generate noise, but targeted patterns can identify specific credential formats:

# AWS Access Keys (20 alphanumeric characters starting with AKIA)
./archaeologit.sh username 'AKIA[0-9A-Z]{16}'

# Generic API keys (look for common variable assignments)
./archaeologit.sh username 'api[_-]?key["\']?\s*[:=]\s*["\'][a-zA-Z0-9]{20,}'

# Private keys
./archaeologit.sh username 'BEGIN.*PRIVATE KEY'

# Database connection strings
./archaeologit.sh username 'postgres://.*:.*@'

What makes this approach effective is its simplicity. Unlike sophisticated tools that parse code semantically, Archaeologit treats your repository as a text stream. This means it catches secrets in unusual contexts—embedded in documentation, commented-out code, test fixtures, or binary file diffs. The tradeoff is false positives, but for a security audit, false positives are far preferable to false negatives.

One architectural decision worth noting: the script clones repositories sequentially and processes them one at a time. For a user with dozens of repositories, this can take hours. A more sophisticated implementation might use GNU Parallel or xargs with parallel execution:

curl -s "https://api.github.com/users/${USER}/repos?per_page=100" | \
  grep -oP '"clone_url": "\K[^"]+' | \
  parallel -j 4 './scan_single_repo.sh {} "$PATTERN"'

However, the sequential approach has advantages for observability—you can watch progress in real-time and kill the script early if you find critical issues. It also avoids hammering GitHub's servers or your local disk with concurrent clones.

Gotcha

The single-threaded architecture becomes painful quickly. Scanning a prolific GitHub user with 50+ repositories, each with years of history, can take multiple hours. There's no progress indicator beyond the current repository name, so you're left guessing whether the script is hung or just processing a large repository. For any serious security auditing at scale, you'll want to fork the code and add parallel processing.

The false positive rate is substantial. Searching for 'password' will match variable names, comments explaining authentication, documentation about password policies, and test fixtures with dummy credentials. Every match requires manual review to determine if it's a genuine leak. The script outputs raw matches without context about which commit or file contained the match, making investigation tedious. You'll often find yourself running additional git commands to trace back from the match to the actual commit hash and author. More critically, Archaeologit only scans public repositories by default. If you need to audit private repositories, you'll need to modify the script to use authenticated API calls and git clone over SSH or HTTPS with credentials. The tool also has no concept of secret validity—it can't tell if an AWS key has been rotated or if a password has been changed, so you'll spend time investigating historical secrets that may no longer pose a risk.

Verdict

Use if: You need a quick, auditable security check of public GitHub repositories for a specific user or organization. Archaeologit excels at one-off security audits where you want complete visibility into what the tool is doing with your data. Its simple bash implementation means you can read and understand every line in minutes, making it trustworthy for sensitive security work. It's also ideal for educational purposes—studying the code teaches you about git internals and API interactions. Skip if: You need to scan private repositories without modifying code, require low false-positive rates for automated workflows, need faster processing across dozens of repositories, or want integration with CI/CD pipelines and security dashboards. For production security scanning at scale, tools like gitleaks or truffleHog offer better performance, lower false positives through entropy analysis, and pre-configured rulesets. For prevention rather than detection, implement git-secrets hooks to stop secrets from entering history in the first place.

Archaeologit: Why Deleting Secrets From Your Repo Isn't Enough

Archaeologit: Why Deleting Secrets From Your Repo Isn't Enough

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

Archaeologit: Why Deleting Secrets From Your Repo Isn't Enough

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// RELATED

Open Interpreter: Running GPT-4 with Root Access to Your Machine

Accomplish: Why Wrapping OpenCode Instead of Building an Agent Runtime Was the Right Bet

NVIDIA Cosmos: A Case Study in Strategic Repository Deprecation

How Ripgrep Makes Searching 10x Faster Than Grep: A Deep Dive Into Rust-Powered Text Search

Open Interpreter: Running GPT-4 with Root Access to Your Machine

Accomplish: Why Wrapping OpenCode Instead of Building an Agent Runtime Was the Right Bet

NVIDIA Cosmos: A Case Study in Strategic Repository Deprecation

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]