Hound: Why Trigram Indexing Makes Code Search 100x Faster Than grep

Hook

Most code search engines scan entire files to find matches. Hound doesn’t. Using trigram indexing borrowed from Russ Cox’s research, it can search millions of lines of code in milliseconds—a performance level achievable even on modest hardware.

Context

Before tools like Hound, developers searching across multiple repositories had limited options: manually grep through cloned repos (slow and manual), use heavyweight enterprise tools like Elasticsearch with code analyzers (complex infrastructure), or rely on GitHub’s search (which doesn’t work for private, on-premises code). Each approach failed in different ways. Command-line tools like grep required local checkouts and offered no web interface. Enterprise search engines needed dedicated ops teams. Cloud-hosted solutions locked you into specific platforms.

Hound emerged from this gap—teams wanted GitHub-quality search for their private codebases without the infrastructure overhead. Built by engineers who needed fast, self-hosted search across dozens of repositories, Hound makes a key architectural bet: use trigram indexing based on Russ Cox’s research on regular expression matching. This approach, proven at scale by Google Code Search, enables regexp searches without full-text scanning. The result is a Go backend that keeps an up-to-date index for each repository and a React static frontend that talks to it—nothing more. No JVM, no Elasticsearch cluster, no complex configuration.

Technical Insight

System architecture — auto-generated

Hound’s speed comes from its indexing strategy described in Russ Cox’s article on regular expression matching with trigram indexes. The core is based on breaking source code into three-character sequences (trigrams) and building an inverted index. When you search for a pattern, Hound identifies trigrams that must appear in matching text, looks up their locations in the index, then scans those specific regions with the full regexp engine.

The system consists of two binaries: houndd (the server daemon) and hound. The server handles repository polling, indexing, and answering searches through a minimal API. Configuration is straightforward JSON listing repositories to index:

{
  "dbpath": "db",
  "repos": {
    "MyProject": {
      "url": "https://github.com/myorg/myproject.git",
      "vcs-config": {
        "ref": "main"
      },
      "ms-between-poll": 30000
    },
    "LegacyApp": {
      "url": "git@github.com:myorg/legacy.git",
      "vcs": "git",
      "ms-between-poll": 60000
    }
  }
}

Each repository entry specifies a VCS type (Git, Mercurial, SVN, Bazaar, or local directory), polling interval, and optional branch references. The dbpath points to where Hound stores its trigram indexes. On startup, houndd clones repositories, builds indexes, then polls for updates at configured intervals—defaulting to every 30 seconds.

The VCS-agnostic design is worth noting. Hound supports Git (default), Mercurial ("vcs": "hg"), SVN ("vcs": "svn"), Bazaar ("vcs": "bzr"), and local directories ("vcs": "local"). For private repositories, you can use SSH URLs like git@github.com:foo/bar.git (assuming SSH keys are configured), the file:// protocol for local clones, or the local VCS driver for non-repository directories with "watch-changes": true to trigger re-indexing when files change.

The React frontend is intentionally minimal—it’s a static application that talks to the Go backend. This separation means you can run Hound behind Apache or nginx, or scale components separately. The backend keeps indexes persistent in the dbpath directory, so restarts don’t require full re-indexing unless repository contents changed.

Performance tuning happens through two main knobs: ms-between-poll controls index freshness versus system load, and max-concurrent-indexers limits how many repositories index simultaneously. For organizations with hundreds of repositories, you’d increase polling intervals and limit concurrent indexers to avoid resource spikes.

Deployment is deliberately simple. Install Go 1.16+ and npm, run make, and you’ll find the resulting binaries in .build/bin/. Copy the binaries and config to your server, start houndd. That’s it. No database servers, no container orchestration, no complex dependencies. The Docker image follows the same philosophy: it’s a single container that mounts your config and runs the server.

Gotcha

Hound makes specific tradeoffs that become limitations in certain contexts. First, there’s no native TLS support. The README states that “most users simply run Hound behind either Apache or nginx” for TLS. The server binds to HTTP only—you must run a reverse proxy for HTTPS in production. This isn’t unreasonable for most deployments, but it means one more component to configure.

Second, repository updates are polling-based exclusively. By default, Hound polls the URL in the config for updates every 30 seconds. You can override this with the ms-between-poll key per repository. There’s no mention of webhook support for instant updates when code is pushed. For rapidly changing repositories, this creates windows where search results may be stale.

Third, Windows support is explicitly noted as unsupported in the README: “Hound on Windows is not supported but we’ve heard it compiles and runs just fine (although it helps to exclude your data folder from Windows Search Indexer).” The project is only tested on MacOS and CentOS, though it should work on any *nix system. If Windows is your primary deployment target, expect it to be unofficial territory.

Finally, the README makes no mention of access controls, audit logs, or multi-tenancy features. It appears to be a single-tenant search engine where everyone with access to the web interface can search everything configured. For organizations needing fine-grained permissions or compliance features, you’ll need to layer those on externally.

Verdict

Use Hound if you need fast, self-hosted code search across multiple private repositories without infrastructure complexity. It excels for teams who want powerful regexp search on-premises with minimal operational overhead—just Go 1.16+ as a requirement. The trigram indexing approach, based on proven research from Russ Cox, enables fast regexp searches. The simple deployment model (Go binary + JSON config) means you can run it on a single VM or developer workstation. It’s ideal when you’re already running Apache/nginx for reverse proxying, don’t need real-time indexing, and want something that just works.

Skip it if you require enterprise features like granular access controls or audit trails, which are not mentioned in the documentation. Also avoid if you need instant indexing via webhooks rather than polling-based updates, if Windows is your primary deployment platform (officially unsupported), or if you need TLS built directly into the server rather than via reverse proxy. For teams needing code intelligence features beyond text search, more feature-rich tools may be worth the added complexity—but for straightforward multi-repo regexp search with VCS flexibility (Git, Mercurial, SVN, Bazaar, local), Hound’s simplicity is its core strength.

Hound: Why Trigram Indexing Makes Code Search 100x Faster Than grep

Hound: Why Trigram Indexing Makes Code Search 100x Faster Than grep

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

Hound: Why Trigram Indexing Makes Code Search 100x Faster Than grep

Hook

Context

Technical Insight

Gotcha

Verdict

// RELATED

Cockatrice: Building a Cheat-Resistant Card Game Simulator with Qt and Server Authority

Cloudlist: Building a Multi-Cloud Asset Inventory Without the CSPM Bloat

Noir: Mining Your Codebase for Shadow APIs Before Attackers Do

transfer.sh: Building a Command-Line File Sharing Service with Pluggable Storage

Cockatrice: Building a Cheat-Resistant Card Game Simulator with Qt and Server Authority

Cloudlist: Building a Multi-Cloud Asset Inventory Without the CSPM Bloat

Noir: Mining Your Codebase for Shadow APIs Before Attackers Do

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]