Hound: How Trigram Indexing Delivers Sub-Second Code Search Without ElasticSearch
Hook
While developers wait seconds for GitHub's search to scan a single repository, Hound can grep through 50 repositories with millions of lines of code in under 200 milliseconds. The secret? A decades-old algorithm that Google barely talks about anymore.
Context
Code search is a surprisingly hard problem. When your engineering team grows beyond a handful of repositories, finding where a function is called or tracking down a configuration pattern becomes painful. GitHub's built-in search is slow and has limited regex support. GitLab's search chokes on large result sets. Your IDE can only search what you have checked out locally. Teams often resort to cloning everything and running grep, which works but doesn't scale—good luck searching across 50 repositories when you can't remember which one contains that authentication middleware.
The obvious solution is full-text search engines like ElasticSearch or Solr, which is exactly what enterprise tools like Sourcegraph use. But these require significant infrastructure: ElasticSearch clusters, memory tuning, sharding strategies, and dedicated DevOps attention. For small-to-medium teams who just want fast regex search without the operational burden, this feels like using a chainsaw to slice bread. Hound emerged from Etsy's engineering team in 2014 as a lightweight alternative—fast code search with minimal dependencies, deployed as a single binary that just works.
Technical Insight
Hound's speed comes from trigram indexing, an algorithm developed by Russ Cox (of Go fame) that Google used internally before Code Search was deprecated. Instead of parsing code or building abstract syntax trees, trigrams break text into overlapping three-character sequences. The string "func" generates trigrams: "fun" and "unc". Every file gets indexed by its trigram set, and searches use boolean operations on these sets to eliminate files that can't possibly match before running the actual regex.
Here's why this matters: searching for the regex handleRequest.*timeout across a million-line codebase requires checking every file with traditional grep. With trigrams, Hound first filters to only files containing "han", "and", "ndl", "dle"... and "tim", "ime", "meo", "eou", "out". This typically eliminates 99% of files before regex evaluation even starts. The index is just a map of trigrams to file offsets—simple, fast, and disk-friendly.
The architecture is refreshingly minimal. The backend (houndd) is a Go binary that reads a JSON config file defining repositories to index:
{
"max-concurrent-indexers": 2,
"dbpath": "data",
"repos": {
"hound-search/hound": {
"url": "https://github.com/hound-search/hound.git",
"ms-between-poll": 30000,
"exclude-dot-files": true
},
"kubernetes/kubernetes": {
"url": "https://github.com/kubernetes/kubernetes.git",
"url-pattern": {
"base-url": "https://github.com/kubernetes/kubernetes/blob/master{path}{anchor}",
"anchor": "#L{line}"
}
}
}
}
Each repository gets cloned to the dbpath directory, indexed into trigram files, and polled for updates. The REST API is minimal—essentially just /api/v1/search?q=pattern&repos=*&i=nope where i=nope means case-sensitive. The React frontend is a static bundle that calls this API and renders results with syntax highlighting and links back to source.
What makes this architecture valuable for small teams is its statelessness and independence. Unlike ElasticSearch, which requires cluster coordination and replication, each Hound instance is entirely self-contained. Need more capacity? Spin up another instance with the same config. No sharding logic, no master election, no split-brain scenarios. The tradeoff is that scaling means full data replication—if you're indexing 100GB of code, every instance needs 100GB—but for most teams with under 100 repositories, this is perfectly acceptable on modern hardware.
The polling model is both a strength and weakness. Every ms-between-poll milliseconds (default 30 seconds), houndd runs git pull and re-indexes changed files. This is simple and works with any VCS, but it means searches can be stale and polling storms can spike CPU usage when many repos update simultaneously. There's no webhook support or event-driven indexing. For teams with relatively stable codebases and tolerance for 30-second lag, this is fine. For teams deploying dozens of times per day who need instant search results, it's frustrating.
Editor integrations reveal Hound's real-world utility. The Vim plugin lets you search from your editor and populate the quickfix list, the VSCode extension adds a search panel, and the Emacs integration hooks into helm. This isn't a tool that forces you into a web UI—it's infrastructure that makes your existing workflow faster. The fact that these integrations exist and are maintained suggests genuine adoption beyond weekend hack projects.
Gotcha
Hound has three significant limitations that aren't obvious until you're in production. First, there's no built-in TLS support. The Go binary serves plain HTTP, so production deployments require nginx or Apache as a reverse proxy for HTTPS. This isn't technically difficult, but it's extra configuration and another failure point. For security-conscious teams, it also means Hound can't handle authentication or authorization—you're bolting that on via the reverse proxy or accepting that anyone with network access can search your code.
Second, the polling model breaks down at scale. With 100 repositories polling every 30 seconds, you're running 200 git operations per minute. If each poll takes 2-3 seconds (not uncommon with large repos or network latency), you'll have constant indexing activity and CPU spikes. You can increase ms-between-poll, but then search staleness increases proportionally. There's no smart triggering—Hound can't detect that nothing changed and skip indexing. For large monorepos or organizations with hundreds of active repositories, this becomes a resource management problem.
Third, Hound does pure text matching with no semantic understanding. It doesn't know what a function definition is versus a function call. It can't find "all usages" of a symbol across languages. It won't suggest similar results or handle typos. If you're used to IDE-level code intelligence or language servers, Hound will feel primitive. It's grep with an index, not an intelligent code analysis tool. This is by design—semantic indexing requires orders of magnitude more complexity—but it means Hound complements rather than replaces IDE features.
Verdict
Use Hound if you're a team of 5-50 engineers with 10-100 repositories who need fast cross-repo search without operational overhead. It excels when you're comfortable with regex, your repos update at a measured pace (not hundreds of deploys daily), and you want something you can deploy in an afternoon and forget about. It's particularly valuable if you're using GitHub Enterprise or GitLab self-hosted where built-in search is lackluster, or if you have a mix of VCS systems that need unified search. The editor integrations make it genuinely useful for daily development, not just occasional archaeological digs through legacy code. Skip Hound if you need real-time indexing (under 10-second lag), semantic code intelligence (go-to-definition, find-all-references), or you're operating at serious scale (hundreds of repos, millions of searches per day). Also skip if you don't control your infrastructure—Hound requires persistent storage for indexes and background polling processes, so it's not suitable for serverless architectures or heavily locked-down environments. Finally, if you're already paying for GitHub or GitLab and their search is good enough, don't add infrastructure complexity just for marginal speed improvements.