Swiss Army Llama: Building Privacy-First Semantic Search with llama.cpp and Advanced Similarity Measures
Hook
Most semantic search systems stop at cosine similarity, but what if your embeddings have non-linear relationships that cosine distance completely misses? Swiss Army Llama implements statistical measures like Hoeffding’s D and distance correlation through a custom Rust library—and that’s just one of its tricks.
Context
The semantic search landscape has been dominated by cloud APIs and managed services for good reason: they’re simple, scalable, and increasingly affordable. But this convenience comes with significant tradeoffs: your documents leave your infrastructure, you’re locked into specific embedding models, and you pay per API call indefinitely. For organizations with compliance requirements, proprietary data, or the need to experiment with cutting-edge open models, these constraints are dealbreakers.
Swiss Army Llama emerged to fill this gap—a self-hosted semantic search service that treats privacy and model flexibility as first-class concerns. Built on FastAPI and llama.cpp, it’s designed for teams who want the full semantic search pipeline (document ingestion, OCR, audio transcription, vector search) running entirely on their own hardware. The ‘Swiss Army’ name is earned: this isn’t just an embedding API, it’s a comprehensive document intelligence platform that handles PDFs with OCR fallback, extracts text from Word documents, transcribes audio through Whisper, and applies multiple embedding pooling strategies. The project stands out by going beyond standard cosine similarity to offer advanced statistical measures for finding semantic relationships that simpler metrics miss.
Technical Insight
Swiss Army Llama’s architecture is a masterclass in pragmatic composition. At its core, llama.cpp handles the heavy lifting of model inference—both for generating embeddings and text completions. This choice is deliberate: llama.cpp’s quantization support means you can run models like Llama 2 13B on consumer hardware with 4-bit or 5-bit quantization, dramatically reducing memory requirements without catastrophic quality loss.
The embedding pipeline is where things get interesting. When you submit a document, the service first routes it through textract for text extraction. For PDFs, it attempts direct text extraction, but if that yields insufficient content (common with scanned documents), it automatically falls back to Tesseract OCR. Audio files go through Whisper for transcription. Everything converges into a unified text representation that’s then chunked and embedded.
Here’s a practical example of using the semantic search endpoint:
import requests
im
url = "http://localhost:8089/search/"
payload = {
"query_text": "What are the privacy implications of cloud services?",
"number_of_most_similar_strings_to_return": 10,
"similarity_filter_percentage": 0.02,
"corpus_identifier_string": "policy_docs"
}
response = requests.post(url, json=payload)
results = response.json()
for result in results["search_results"]:
print(f"Score: {result['similarity_score']:.4f}")
print(f"Text: {result['text'][:200]}...\n")
What makes this architecture powerful is the two-phase search strategy. Phase one uses FAISS (Facebook AI Similarity Search) for rapid approximate nearest neighbor retrieval based on cosine similarity. FAISS creates an index of your embeddings and can filter thousands of vectors in milliseconds. But cosine similarity assumes linear relationships in your embedding space—it measures the angle between vectors, not more complex statistical dependencies.
Phase two is where Swiss Army Llama differentiates itself. After FAISS returns the top-k candidates, the service applies advanced similarity measures through a custom Rust library called fast_vector_similarity. This library implements measures like:
- Hoeffding’s D: Detects any functional relationship between vectors, not just linear correlations
- Distance correlation: Measures both linear and non-linear association
- Kendall’s tau: Rank-based correlation resistant to outliers
- Mutual information: Captures statistical dependency from information theory
These measures can surface semantically related content that cosine similarity misses. For example, if your embeddings capture cyclical patterns (like seasonal trends in financial documents) or threshold relationships (like regulatory compliance triggers), Hoeffding’s D will detect these dependencies where cosine similarity fails.
The embedding pooling strategy is also configurable beyond standard mean pooling. The service supports:
# Example configuration for different pooling strategies
pooling_options = [
"mean", # Standard: average all token embeddings
"svd", # Singular value decomposition for dimensionality reduction
"ica", # Independent component analysis
"factor_analysis", # Statistical factor extraction
"gaussian_random" # Random projection for speed
]
SVD pooling, for instance, identifies the principal components of your text’s embedding space and uses those for representation. This can be more effective than mean pooling for longer documents where different sections have distinct semantic themes. ICA separates statistically independent components, useful when documents mix multiple topics.
The caching strategy is aggressive and intelligent. Every computed embedding is stored in SQLite with a hash of the input text and model parameters. This means if you reprocess documents or run similar queries, you avoid expensive recomputation. For production deployments, the documentation recommends RAM disks for both the SQLite database and model files:
# Create and mount a RAM disk for model files
sudo mkdir -p /mnt/ramdisk
sudo mount -t tmpfs -o size=32G tmpfs /mnt/ramdisk
# Symlink your models
ln -s /mnt/ramdisk/models ~/swiss_army_llama/models
This configuration can reduce model loading times from 30+ seconds to under 5 seconds for large models, crucial for responsive API performance.
The service also exposes endpoints for direct LLM completions, making it a unified interface for both embedding and generation tasks. This is particularly valuable for RAG (Retrieval-Augmented Generation) workflows where you want to embed documents, search them, and generate responses—all using the same local model without bouncing between services.
Gotcha
The installation process is the first major friction point. Swiss Army Llama has a dependency tree that reads like a DevOps nightmare: textract (which itself requires system packages for various document formats), Tesseract for OCR, ffmpeg and sox for audio processing, poppler-utils for PDF handling, and Redis for caching. On macOS, you’ll spend 30 minutes with Homebrew. On Linux, the dependency installation varies wildly by distribution. Docker deployment is strongly recommended, but even the Dockerfile is substantial. Budget at least an hour for first-time setup, more if you hit version conflicts.
Performance is strictly bounded by your hardware. With llama.cpp, you’re running models locally, which means a 13B parameter model with 4-bit quantization still needs ~8GB of RAM just for the model weights. Add the embedding cache, FAISS indices, and the application overhead, and you’re looking at 16GB+ for comfortable operation. On CPU-only systems, embedding a 1000-word document can take 5-10 seconds with larger models. GPU acceleration helps dramatically but requires CUDA setup and compatible hardware.
The SQLite-based architecture creates scaling challenges. While perfect for single-user scenarios or internal tools with modest concurrency, SQLite’s write serialization means concurrent embedding requests will queue. Under heavy load, you’ll hit lock contention. The project doesn’t implement connection pooling or async database operations comprehensively, so adding these would require significant refactoring. For production scenarios serving multiple users simultaneously, you’d need to architect around this—perhaps with a job queue system like Celery and PostgreSQL instead of SQLite.
The advanced similarity measures, while powerful, come with computational costs. Computing Hoeffding’s D or distance correlation for hundreds of candidate vectors adds latency. The two-phase approach mitigates this by filtering with FAISS first, but for real-time applications requiring sub-100ms responses, you might need to stick with pure FAISS cosine similarity and skip the advanced measures entirely.
Verdict
Use Swiss Army Llama if: you have compliance or privacy requirements that prohibit sending documents to external APIs; you want to experiment with different open-source embedding models without API lock-in; you’re building internal tools where setup complexity is a one-time cost and hardware resources are available; you need the full document processing pipeline (OCR, audio transcription) with semantic search in one package; or you’re researching embedding similarity measures and want to explore beyond cosine distance. Skip it if: you need production-grade horizontal scalability with high concurrent load—managed services or txtai with proper database backends are better suited; you’re working on resource-constrained hardware like low-memory cloud instances or edge devices; you want zero-setup simplicity and are comfortable with API costs and data leaving your infrastructure; or your use case is straightforward semantic search where sentence-transformers with ChromaDB would suffice without the operational overhead.