QMD: A Three-Stage Search Engine That Runs Entirely on Your Laptop
Hook
Most semantic search engines send your documents to OpenAI or Anthropic. QMD does the entire pipeline—embeddings, reranking, and all—locally, using GGUF models that fit in your GPU's VRAM.
Context
Developer documentation sprawls. You've got Notion pages, markdown notes, meeting transcripts, and API docs scattered across directories. Traditional grep fails because it can't understand that "authentication" and "login flow" are related. Cloud-based semantic search tools like Algolia or Pinecone solve this, but they require uploading your documents to third-party servers—a non-starter for proprietary codebases or personal knowledge bases.
QMD emerged from this privacy-first constraint: build a search engine that matches the semantic understanding of cloud tools while keeping every byte of data and computation on your machine. It implements current state-of-the-art retrieval techniques (BM25, dense vectors, LLM reranking) in a TypeScript CLI that indexes your markdown files and serves results through both a command-line interface and the Model Context Protocol, making it directly usable by AI agents like Claude Desktop.
Technical Insight
The architecture is a three-tier ranking funnel, where each stage narrows results using progressively more expensive but more accurate techniques. The first stage runs BM25, a probabilistic keyword search algorithm that scores documents based on term frequency and inverse document frequency. BM25 is fast—microseconds per query—but purely lexical. It finds "authentication" in your docs but misses "login credentials" unless you type those exact words.
The second stage generates vector embeddings using a local transformer model (via node-llama-cpp). When you index documents, QMD chunks them into sections and runs each through an embedding model like nomic-embed-text. These embeddings live in SQLite as binary blobs, and queries get embedded the same way. A cosine similarity search retrieves semantically related documents even when keywords don't match. Here's what indexing looks like programmatically:
import { QMD } from '@tobi/qmd';
const qmd = new QMD({
dbPath: './search.db',
embeddingModel: 'nomic-embed-text-v1.5.Q4_K_M.gguf',
rerankModel: 'bge-reranker-v2-m3.Q4_K_M.gguf'
});
// Add a collection that watches a directory
await qmd.addCollection({
name: 'docs',
path: './documentation/**/*.md',
contextAnnotations: {
type: 'product_docs',
version: '2.0'
}
});
// Index all documents
await qmd.index();
The contextAnnotations object is where QMD's hierarchical context system shines. Documents inherit annotations from parent directories, creating a semantic tree. If you annotate a top-level "API Reference" folder with {section: 'api', audience: 'developers'}, every child document carries that context. During reranking, the LLM sees not just the document text but also its lineage, helping it understand that a page titled "Authentication" in the API section is more relevant to "how do I authenticate API requests" than an "Authentication" page in your company handbook.
The third stage is LLM-based reranking. The top candidates from vector search get passed to a small reranking model (typically a BERT-based cross-encoder) that scores each document against the original query. This model understands nuanced relationships—it knows that "Python SDK installation" is a better match for "how to install the library" than "Python best practices," even if the embeddings are similar. The reranker runs locally using GGUF quantized models, so inference happens in seconds on a modern laptop GPU.
Query expansion happens automatically before the pipeline runs. QMD uses an LLM to generate variations of your query optimized for different search backends. If you search "auth errors," it might expand to "authentication failures," "login error messages," and "authorization exceptions" for BM25, while keeping the original semantic query for vector search. This happens in a single LLM call:
const results = await qmd.search('auth errors', {
limit: 10,
expandQuery: true, // Generates backend-specific query variations
includeContext: true // Include parent document annotations
});
results.forEach(result => {
console.log(`${result.title} (score: ${result.score})`);
console.log(`Context: ${JSON.stringify(result.contextAnnotations)}`);
});
The MCP (Model Context Protocol) server integration is what makes QMD unique for agent workflows. You can start an HTTP MCP server that Claude Desktop or other AI agents can query directly. The server keeps embedding and reranking models loaded in VRAM across requests, avoiding the 10-15 second cold-start penalty on every query. Idle contexts get cleaned up after 5 minutes to free memory:
qmd mcp --transport http --port 3000
Now Claude can search your documentation by calling MCP tools. When you ask Claude "How does our rate limiting work?", it queries your indexed docs, retrieves the relevant sections, and synthesizes an answer—all without sending your proprietary documentation to Anthropic's servers. The query happens locally, the reranking happens locally, and only the final curated context gets included in Claude's prompt.
Under the hood, SQLite stores everything: document text, embeddings as BLOB columns, BM25 indexes via FTS5, and a separate table tracking document lineage. This is architecturally important—SQLite handles concurrency, indexing, and persistence without requiring a separate vector database. The tradeoff is that vector similarity search uses brute-force cosine similarity (compute similarity against every embedding), which doesn't scale past tens of thousands of documents. For personal knowledge bases, this is fine. For production search, you'd need approximate nearest neighbor indexes like HNSW.
Gotcha
Resource requirements are the first wall you'll hit. The default embedding model (nomic-embed-text) needs about 500MB of VRAM, and reranking models add another 300-500MB. If you're running this on a laptop with integrated graphics or limited RAM, expect slow performance or out-of-memory errors. Indexing 10,000 markdown files can take 30-60 minutes because every document gets chunked, embedded, and stored locally—there's no parallelization across machines or incremental indexing that skips unchanged files (though file modification checks prevent re-embedding).
Query latency is the second gotcha. A single search with query expansion, vector retrieval, and reranking takes 2-5 seconds on a MacBook Pro with an M2 chip. That's acceptable for interactive CLI usage but glacial compared to pure BM25 search (milliseconds) or managed services like Algolia (sub-100ms). The automatic query expansion adds overhead even when you don't need it—searching for an exact file name still triggers LLM-based query variations. There's no query planner that skips stages when a simple keyword lookup would suffice.
The TypeScript implementation also limits deployment options. Unlike Python-based tools (txtai, LlamaIndex), you can't easily integrate QMD into data science workflows or Jupyter notebooks. It's CLI-first, with an SDK that assumes Node.js environments. If your team uses Python for ML pipelines, bridging to QMD requires running it as a subprocess or MCP server, adding architectural complexity.
Verdict
Use if: You're building local-first AI agents that need private semantic search over personal or proprietary knowledge bases. You have a modern laptop with dedicated GPU/sufficient RAM and work with document collections under 50,000 files. You value privacy enough to accept slower queries and higher resource usage than cloud alternatives. The MCP integration makes this ideal for Claude Desktop power users who want to query personal notes, meeting transcripts, or internal docs without uploading them. Skip if: You need production search with sub-second latencies, work with massive document corpora (millions of files), or lack hardware for local LLM inference. Cloud-based solutions like Algolia or managed Elasticsearch will be faster and more scalable. Also skip if your team is Python-first—txtai or LlamaIndex offer similar local embeddings with better ML ecosystem integration.