Back to Articles

QMD: Building a Three-Stage Hybrid Search Engine That Runs Entirely on Your Laptop

[ View on GitHub ]

QMD: Building a Three-Stage Hybrid Search Engine That Runs Entirely on Your Laptop

Hook

Most developers searching their notes use grep or basic full-text search, while billion-dollar companies deploy semantic search with LLMs and vector databases. QMD brings the latter’s sophistication to a single-binary CLI tool that runs entirely on your laptop.

Context

The knowledge management problem has bifurcated. On one side, you have simple tools—grep, Spotlight, basic search bars—that find exact matches but miss semantically related content. On the other, you have enterprise-grade systems with vector databases, embedding models, and re-rankers that understand meaning but require cloud APIs, ongoing costs, and sending your private data to third parties.

QMD emerged from the recognition that the hardware gap has closed. Modern laptops can run quantized LLMs locally, and GGUF models have made sophisticated NLP capabilities accessible without GPU clusters. The question wasn’t whether local semantic search was possible, but whether someone would implement the full retrieval pipeline—query expansion, hybrid search, fusion algorithms, and re-ranking—in a way that “just works” for personal knowledge bases. For developers building AI agents through Model Context Protocol (MCP) or those simply wanting better search over meeting notes and documentation, QMD represents a complete solution that doesn’t phone home.

Technical Insight

User Query

Query Expansion

LLM generates variants

Original Query

weight: 2x

Variant Query 1

Variant Query 2

BM25/FTS5 Search

Vector Search

BM25/FTS5 Search

Vector Search

BM25/FTS5 Search

Vector Search

Reciprocal Rank Fusion

k=60, original boost 2x

Top 30 Results

LLM Re-ranking

qwen3-reranker

Position-Aware Blend

75% RRF for top ranks

Final Results

System architecture — auto-generated

QMD’s architecture is a masterclass in retrieval pipeline design. Rather than choosing between keyword and semantic search, it orchestrates three complementary techniques: BM25 full-text search via SQLite’s FTS5, vector semantic search using embeddings, and LLM-based re-ranking. The sophistication lies in how these stages interact.

The pipeline begins with query expansion. When you search for “database performance issues,” QMD uses a local LLM to generate two alternative phrasings—perhaps “slow database queries” and “database optimization problems.” This creates three queries total. Each query runs through both BM25 and vector search, producing six result sets. Here’s what the search flow looks like:

// Simplified conceptual flow
const originalQuery = "database performance issues";
const expandedQueries = await generateVariants(originalQuery);
// Returns: ["slow database queries", "database optimization problems"]

const allQueries = [originalQuery, ...expandedQueries];
const resultSets = [];

for (const query of allQueries) {
  const bm25Results = await searchBM25(query);
  const vectorResults = await searchVectors(query);
  resultSets.push({ query, bm25Results, vectorResults });
}

// Apply Reciprocal Rank Fusion with 2x weight for original query
const fusedResults = reciprocalRankFusion(resultSets, {
  originalQueryBoost: 2.0,
  k: 60  // RRF constant
});

const top30 = fusedResults.slice(0, 30);

// Re-rank using qwen3-reranker with logprobs
const reranked = await llmRerank(originalQuery, top30);

// Blend scores: 75% RRF for top 3, decreasing for lower ranks
const finalResults = positionAwareBlend(fusedResults, reranked);

The 2x weighting on the original query is crucial. Query expansion risks drifting from user intent—you wanted “database performance,” not a general search about databases. By doubling the original query’s contribution to the Reciprocal Rank Fusion score, QMD prevents overfitting to expanded variants while still benefiting from broader recall.

The re-ranking stage uses a lightweight model (qwen3-reranker) that evaluates each document’s relevance using logprobs rather than generating text. This is significantly faster than full LLM generation and provides calibrated probability scores. But QMD doesn’t blindly trust the re-ranker either. The position-aware blending gives 75% weight to RRF scores for the top three results, gradually decreasing for lower ranks. This prevents the re-ranker from completely overriding strong keyword matches—if a document contains your exact search terms and ranks highly in BM25, it won’t get buried just because the re-ranker prefers something else.

The hierarchical context system is where QMD becomes genuinely useful for AI agents. When indexing documents, you can attach metadata about the collection:

qmd add ./meeting-notes \
  --context "Weekly engineering team meetings from Q4 2024. \
             Participants: backend team. Topics: API redesign, \
             database migration planning, incident retrospectives."

This context propagates to search results. When an LLM agent retrieves documents through QMD’s MCP server, it receives not just the matching snippets but also the contextual metadata. The agent knows these are meeting notes from a specific team and timeframe, enabling more sophisticated reasoning about whether the information is relevant, current, or authoritative.

Under the hood, QMD stores everything in SQLite—documents, embeddings, and metadata. It uses node-llama-cpp to run GGUF models locally, meaning you can swap in different embedding models or re-rankers by simply pointing to different model files. The MCP server exposes QMD’s search capabilities to AI agents with optional HTTP transport, keeping models loaded in VRAM between requests to avoid cold-start penalties that would make agent interactions painfully slow.

Gotcha

The local-first architecture has real constraints. You need enough RAM to load embedding and re-ranking models, typically 2-8GB depending on model size. Embedding a large document collection is time-consuming—expect minutes or hours for tens of thousands of documents on first indexing, though incremental updates are much faster. If you’re embedding 100GB of documents or need millisecond query latency at scale, QMD will struggle.

Search quality is bounded by model quality. GGUF quantized models are impressive but still inferior to state-of-the-art cloud embeddings like OpenAI’s text-embedding-3 or Cohere’s models. For personal knowledge bases, this trade-off is usually acceptable—you’re searching your own notes, where even imperfect semantic understanding beats keyword-only search. But if you’re building a production search feature for customer-facing documentation where quality is paramount, cloud-based alternatives will outperform QMD’s local models. The tool also lacks advanced features like faceted search, spelling correction beyond basic fuzzy matching, or real-time indexing pipelines. It’s a CLI tool for personal use, not a replacement for Elasticsearch or Meilisearch in production applications.

Verdict

Use QMD if you maintain personal knowledge bases (Obsidian notes, meeting transcripts, documentation archives) and want semantic search without cloud dependencies, or if you’re building AI agents via MCP that need sophisticated local document retrieval. It’s particularly valuable when privacy matters—legal documents, confidential notes, proprietary research—or when you’re working offline frequently. The combination of hybrid search and LLM re-ranking genuinely improves relevance over simple keyword search, and the MCP integration makes it drop-in ready for agentic workflows. Skip it if you need production-grade search infrastructure with high availability and scale, if your machine can’t spare 4-8GB RAM for models, or if you’re already invested in cloud search APIs and don’t have privacy concerns. Also skip if you need real-time indexing or are searching mostly structured data (logs, metrics) where specialized tools like Grafana Loki or Elasticsearch would be more appropriate. For the target use case—personal knowledge management with semantic capabilities—QMD is excellent. For everything else, it’s the wrong tool.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/developer-tools/tobi-qmd.svg)](https://starlog.is/api/badge-click/developer-tools/tobi-qmd)