Back to Articles

Building RAG Systems From First Principles: A Workshop Teardown

[ View on GitHub ]

Building RAG Systems From First Principles: A Workshop Teardown

Hook

Most RAG tutorials hide the complexity behind LangChain abstractions. This 90-minute workshop strips everything down to raw vector operations and HTTP calls—and that’s precisely why it’s worth studying.

Context

Retrieval-Augmented Generation has become the default pattern for building LLM applications that need to reference specific knowledge bases. But the ecosystem has fragmented into competing frameworks, each adding layers of abstraction that obscure the fundamental operations: embed documents, store vectors, retrieve similar chunks, inject context into prompts.

The llmsnippet repository emerged as workshop material designed to teach RAG concepts in 90 minutes using only the essential components. Instead of reaching for production frameworks like LangChain or LlamaIndex, it wires together Qdrant vector database and Llama.cpp server directly. This constraint forces learners to understand each step explicitly—chunking strategies, embedding dimensionality, similarity search mechanics, and prompt construction. The result is a minimalist implementation that serves as educational scaffolding rather than production code.

Technical Insight

Query

Ingestion

chunk text

embed chunks

vectors

store vectors

question

embed query

semantic search

relevant chunks

context + query

generated answer

endpoints

User Query

conf.py Config

Python RAG App

Qdrant Vector DB

Llama.cpp Server

Documents

System architecture — auto-generated

The architecture centers on three discrete services: your Python application, a Qdrant vector database container, and a Llama.cpp inference server container. The configuration file (conf.py) acts as the coordination layer, defining endpoints for both services. This separation immediately reveals a key insight about RAG systems: they’re fundamentally integration projects connecting specialized services.

The document ingestion pipeline demonstrates the typical RAG preparation phase. Documents get chunked into semantically meaningful segments—the workshop likely uses naive splitting by sentence or paragraph count rather than sophisticated recursive chunking. Each chunk generates an embedding vector through the LLM’s encoding capabilities. These vectors flow into Qdrant collections, where they’re indexed for fast similarity search:

# Simplified ingestion pattern based on architecture
from qdrant_client import QdrantClient
import requests

client = QdrantClient(url="http://localhost:6333")

def embed_text(text, llm_endpoint):
    response = requests.post(
        f"{llm_endpoint}/embedding",
        json={"content": text}
    )
    return response.json()["embedding"]

def ingest_document(doc_text, collection_name):
    # Naive chunking by character count
    chunks = [doc_text[i:i+500] for i in range(0, len(doc_text), 500)]
    
    for idx, chunk in enumerate(chunks):
        vector = embed_text(chunk, LLAMA_ENDPOINT)
        client.upsert(
            collection_name=collection_name,
            points=[{
                "id": idx,
                "vector": vector,
                "payload": {"text": chunk}
            }]
        )

The query-time flow reveals where RAG actually happens. User questions get embedded using the same model that encoded documents—this consistency is critical for meaningful similarity scores. Qdrant performs the vector search, returning the top-k most similar chunks based on cosine similarity. These chunks get concatenated into a context string that’s injected into the LLM prompt:

def query_rag(question, collection_name, top_k=3):
    # Embed the question
    question_vector = embed_text(question, LLAMA_ENDPOINT)
    
    # Search for similar document chunks
    results = client.search(
        collection_name=collection_name,
        query_vector=question_vector,
        limit=top_k
    )
    
    # Extract context from results
    context = "\n".join([hit.payload["text"] for hit in results])
    
    # Construct augmented prompt
    prompt = f"""Context information:
{context}

Question: {question}

Answer based on the context provided:"""
    
    # Generate response
    response = requests.post(
        f"{LLAMA_ENDPOINT}/completion",
        json={
            "prompt": prompt,
            "max_tokens": 256
        }
    )
    
    return response.json()["content"]

The local-first architecture choice—running both Qdrant and Llama.cpp in Docker containers—has pedagogical advantages. Workshop participants see exactly how vector databases expose their APIs and how local LLMs consume prompts. There’s no OpenAI API key abstracting away the embedding process. No managed vector database hiding indexing strategies. Every operation is observable and modifiable.

This transparency comes at the cost of production concerns. There’s no connection pooling, no retry logic, no validation that embeddings match expected dimensions, no handling of Qdrant collection initialization. The workshop format assumes a controlled environment where these edge cases don’t surface during the 90-minute session. The code teaches the happy path—documents exist, services respond, vectors match—which is exactly appropriate for introductory material.

One subtle but important detail: using Llama.cpp instead of cloud APIs teaches participants about the latency-accuracy tradeoff in LLM selection. Local models run slower but cost nothing and keep data private. This forces discussions about when RAG makes sense versus fine-tuning, when local deployment is feasible, and how model size impacts retrieval strategies.

Gotcha

The repository’s biggest limitation is its documentation vacuum. The README provides high-level steps—“install Qdrant, run Llama.cpp, configure endpoints”—but omits critical details. Which Llama model should you download? What collection schema does Qdrant expect? What Python dependencies beyond qdrant-client and requests? The 4-star count and minimal community engagement suggest these gaps haven’t been filled through issues or community contributions.

This creates a chicken-and-egg problem: the code is simple enough to understand once running, but getting it running requires填supplementing missing information. Developers hoping to use this as a learning resource will spend more time on environment setup than studying RAG mechanics. The workshop format likely included verbal instructions and a prepared environment that made these details obvious in person but invisible in the repository. Anyone approaching this cold will hit immediate friction around Docker networking, model compatibility with Llama.cpp, and Qdrant collection configuration. Production concerns are completely absent—no error boundaries, no input validation, no conversation memory, no evaluation metrics. This is intentional for a 90-minute workshop but makes the code unsuitable as a foundation for real applications without substantial hardening.

Verdict

Use if: You’re specifically attending this workshop and need the reference implementation, or you want the absolute minimal skeleton for understanding RAG mechanics without framework abstractions. This is valuable for experienced developers who want to see raw vector operations and can fill in the infrastructure gaps themselves. It’s also useful if you’re designing your own RAG workshop and need a template for what can be covered in 90 minutes. Skip if: You need production-ready code, comprehensive documentation, or a learning resource you can follow independently. The sparse documentation makes self-directed learning frustrating. If you’re building actual applications, reach for LlamaIndex or LangChain which handle edge cases and provide extensive examples. If you’re learning RAG concepts, consider txtai or well-documented tutorials that explain the ‘why’ alongside the ‘how.’ This repository assumes too much context to serve as standalone educational material despite its pedagogical architecture.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/llm-engineering/vmsv-llmsnippet.svg)](https://starlog.is/api/badge-click/llm-engineering/vmsv-llmsnippet)