shhgit: Real-Time Secrets Detection Through Repository Stream Surveillance
Hook
Every 30 seconds, someone pushes AWS credentials to a public GitHub repository. shhgit was built to catch them in real-time by tapping directly into GitHub's event stream—before attackers do.
Context
The developer secrets problem has scaled beyond traditional code review. With millions of commits pushed daily across GitHub, GitLab, and Bitbucket, credentials leak constantly—database passwords in config files, API keys in source code, private keys in Docker images. By the time most scanning tools run their nightly audits, those secrets have been public for hours, indexed by search engines, and potentially harvested by credential scrapers.
Traditional secrets scanners operate in batch mode: clone a repository, scan its history, generate a report. This works for internal audits but fails for threat intelligence and external attack surface monitoring. Security teams needed something fundamentally different—a tool that could monitor the firehose of public repository activity in real-time, catching secrets within minutes of exposure. shhgit emerged in 2019 to fill this gap, built by eth0izzle as a Go-based streaming scanner that consumes public Git APIs and applies pattern matching at scale. Though now unmaintained, its architecture remains instructive for understanding real-time security monitoring.
Technical Insight
shhgit's architecture centers on three components: an API consumer that taps repository event streams, a multi-strategy detection engine, and a configurable output pipeline. The tool can run in two modes: streaming mode (monitoring public repositories) or local mode (scanning directories). The streaming mode is where it differentiates itself from traditional scanners.
When running in streaming mode, shhgit connects to GitHub's Events API (or GitLab/Bitbucket equivalents) and consumes the public event feed. For every push event, it clones the repository to temporary storage, scans the changed files, then destroys the clone. This happens concurrently across multiple goroutines, with configurable limits to prevent resource exhaustion. The scanning itself uses a two-pronged approach: signature-based detection with 150+ regex patterns, and entropy-based detection for high-randomness strings that might be machine-generated secrets.
Here's how the signature matching works in practice. The tool loads patterns from config.yaml, where each signature includes a regex pattern, file path patterns, and validation rules:
AWS API Key:
part: contents
pattern: (?i)aws(.{0,20})?['"][0-9a-zA-Z\/+]{40}['"]
entropy: 3.5
description: AWS API key detected
Slack Webhook:
part: contents
pattern: https://hooks\.slack\.com/services/T[a-zA-Z0-9_]{8}/B[a-zA-Z0-9_]{8}/[a-zA-Z0-9_]{24}
description: Slack webhook URL
The entropy check is particularly clever. Instead of relying solely on pattern matching (which generates false positives from example code, test fixtures, and documentation), shhgit calculates Shannon entropy for matched strings. A genuinely random API key has high entropy (typically >4.5), while placeholder text like YOUR_API_KEY_HERE has low entropy. Here's the conceptual implementation:
func calculateEntropy(data string) float64 {
if data == "" {
return 0
}
freq := make(map[rune]float64)
for _, char := range data {
freq[char]++
}
var entropy float64
length := float64(len(data))
for _, count := range freq {
probability := count / length
entropy -= probability * math.Log2(probability)
}
return entropy
}
The filtering system prevents the tool from drowning in noise. You can blacklist specific strings (like "example.com"), file extensions (like .md for documentation), paths (like test/ directories), and even repository characteristics (minimum stars, maximum size). This is crucial for real-time monitoring because GitHub's event stream includes thousands of toy projects, forks, and example repositories that would otherwise flood your results.
For output, shhgit supports multiple sinks: a web dashboard (when run with Docker Compose), CSV files for batch processing, and webhooks for integration with SIEM systems or Slack. The webhook implementation is straightforward—when a secret is found, shhgit POSTs a JSON payload with the finding details, allowing teams to pipe detections directly into their incident response workflows.
The local scanning mode uses the same detection engine but operates on filesystem paths instead of API streams. This makes shhgit useful in CI/CD pipelines, though it lacks the pre-commit hook integration and Git history scanning that more modern tools provide. When scanning locally, it walks the directory tree, applies the same signature and entropy checks, and outputs findings—essentially functioning as a lightweight secrets scanner without the real-time streaming component.
One architectural limitation: shhgit doesn't perform deep Git history scanning. It only examines files in the current state of a repository. This is fine for real-time monitoring (you care about what was just pushed), but inadequate for comprehensive audits where secrets might be buried 500 commits deep. Tools like TruffleHog solve this by walking the entire commit graph, but that approach is computationally expensive and incompatible with real-time streaming at scale.
Gotcha
The elephant in the room: shhgit is explicitly unmaintained, with a warning banner in the README stating "This project is no longer actively maintained." For security tooling, this is disqualifying. Secrets detection patterns need constant updates as new services launch and authentication patterns evolve. A scanner with 2019-era signatures will miss credentials for services that didn't exist then—Vercel tokens, Cloudflare API keys with new formats, GitHub fine-grained personal access tokens.
Beyond maintenance status, the tool has practical operational challenges. GitHub API rate limits are the primary constraint: free tier tokens allow 5,000 requests per hour, which sounds generous until you realize that monitoring the event stream and cloning repositories burns through quota quickly. You'll need authenticated tokens with higher limits, and even then, you're sampling the firehose rather than consuming it completely. The tool also requires significant disk I/O since it clones entire repositories to scan them—a large monorepo can take minutes to process, during which newer events queue up.
False positives remain a challenge despite the entropy filtering. Test files, example code, documentation, and configuration templates all trigger matches. The blacklist system helps, but it requires manual tuning based on your findings. Expect to spend the first week maintaining your config.yaml to filter noise specific to your monitoring scope. Additionally, the tool has no concept of secret validation—it can't tell if an AWS key is actually valid or just a decommissioned credential sitting in a deprecated repository from 2015.
Verdict
Use shhgit if you're doing short-term OSINT or threat intelligence work where you need to monitor public repositories for your organization's leaked credentials, and you understand you're running unmaintained code. It's also reasonable for one-off local scans where you want a quick audit without installing heavier tooling. The real-time streaming capability and 150+ signatures provide immediate value for these use cases. Skip shhgit for any production security scanning, CI/CD integration, or ongoing monitoring. The unmaintained status is a security liability—you're trusting outdated patterns to catch modern secrets. Instead, migrate to TruffleHog (better history scanning and active development), Gitleaks (faster, better CI/CD integration), or GitGuardian (commercial but superior accuracy). For personal projects or learning how streaming secret detection works architecturally, shhgit's codebase remains educational, but don't deploy it as a security control.