Back to Articles

Building a RAG System from Scratch: A 90-Minute Workshop Walkthrough

[ View on GitHub ]

Building a RAG System from Scratch: A 90-Minute Workshop Walkthrough

Hook

Most developers use LangChain for RAG without understanding what happens under the hood. This 90-minute workshop strips away all the abstractions to show you the actual mechanics of retrieval-augmented generation.

Context

The explosion of large language models created a curious problem: these models are powerful but know nothing about your specific data. They can't access your documentation, your company's knowledge base, or any information that wasn't in their training set. The solution—Retrieval-Augmented Generation—sounds complex but boils down to a simple idea: before asking the LLM a question, first retrieve relevant context from your own documents and include it in the prompt.

Most production implementations hide this simplicity behind layers of frameworks, cloud services, and abstractions. LangChain, LlamaIndex, and similar tools are excellent for production but terrible for learning. The llmsnippet repository takes the opposite approach: it's a minimalist teaching tool designed to demonstrate RAG fundamentals in a workshop setting without any framework overhead. By using only Docker containers for Qdrant (vector database) and Llama.cpp (local LLM inference) plus a handful of Python scripts, it reveals the core pattern that powers everything from ChatGPT's custom instructions to enterprise knowledge systems.

Technical Insight

The architecture follows the classic RAG pipeline: document ingestion with embedding generation, vector storage and similarity search, and context-augmented LLM prompting. What makes this implementation educational rather than production-ready is its deliberate simplicity—every step is visible and modifiable.

The setup uses Docker Compose to orchestrate two services: a Qdrant vector database container and a Llama.cpp server running a quantized model. This local-first approach means no API keys, no rate limits, and complete transparency into how embeddings and inference actually work. Here's the core pattern for document ingestion:

from qdrant_client import QdrantClient
from sentence_transformers import SentenceTransformer

# Initialize embedding model and vector DB client
encoder = SentenceTransformer('all-MiniLM-L6-v2')
client = QdrantClient(host="localhost", port=6333)

# Create collection with vector dimensions matching model output
client.create_collection(
    collection_name="documents",
    vectors_config={"size": 384, "distance": "Cosine"}
)

# Chunk and embed documents
documents = [
    "RAG combines retrieval with generation for better context",
    "Vector databases store embeddings for similarity search",
    "Llama.cpp enables local LLM inference without GPUs"
]

for idx, doc in enumerate(documents):
    vector = encoder.encode(doc).tolist()
    client.upsert(
        collection_name="documents",
        points=[{
            "id": idx,
            "vector": vector,
            "payload": {"text": doc}
        }]
    )

The retrieval step performs cosine similarity search against the stored embeddings. When a user asks a question, you embed that question with the same model, search for the most similar document vectors, and retrieve the original text. This is where the "magic" happens—you're finding semantically related content without keyword matching:

def retrieve_context(query: str, top_k: int = 3):
    # Embed the query using the same model
    query_vector = encoder.encode(query).tolist()
    
    # Search for similar documents
    results = client.search(
        collection_name="documents",
        query_vector=query_vector,
        limit=top_k
    )
    
    # Extract text from results
    context = "\n".join([hit.payload["text"] for hit in results])
    return context

The final step sends a crafted prompt to the Llama.cpp server with retrieved context injected before the user's question. The LLM sees both the question and relevant background information, allowing it to generate informed responses:

import requests

def generate_response(query: str):
    context = retrieve_context(query)
    
    prompt = f"""Use the following context to answer the question.
    
Context:
{context}

Question: {query}

Answer:"""
    
    response = requests.post(
        "http://localhost:8080/completion",
        json={
            "prompt": prompt,
            "temperature": 0.7,
            "n_predict": 256
        }
    )
    
    return response.json()["content"]

What this minimal implementation teaches is that RAG isn't magic—it's just vector similarity search plus prompt engineering. The embedding model converts text into numerical representations where semantic similarity translates to geometric proximity. The vector database performs efficient nearest-neighbor search. The LLM is just a function that takes the augmented prompt and generates text. Understanding these three components separately makes production frameworks far less mysterious.

The workshop format likely walks through each component step-by-step: first showing how embeddings capture meaning, then demonstrating similarity search with concrete examples, and finally revealing how context injection improves LLM outputs. By keeping each piece isolated and inspectable, learners can modify parameters, swap models, and experiment with chunking strategies to see immediate effects.

Gotcha

The biggest limitation is that this is explicitly a learning skeleton, not a foundation for real applications. There's no error handling, no consideration for embedding model mismatches between ingestion and retrieval, no chunking strategy for long documents, and no evaluation metrics to measure retrieval quality. The repository appears to lack comprehensive documentation beyond basic setup—you won't find explanations of why specific embedding dimensions were chosen, how to tune similarity thresholds, or what to do when retrieval returns irrelevant results.

Production RAG systems need sophisticated chunking algorithms (semantic splitting, overlapping windows), hybrid search combining vector and keyword approaches, reranking retrieved results, prompt caching, and robust error handling. This implementation has none of that. The 90-minute timeframe means cutting corners everywhere: no discussion of embedding model selection tradeoffs, no explanation of when cosine versus dot product distance matters, no guidance on collection management or vector indexing strategies. It's a "hello world" that introduces concepts but leaves you stranded when building anything real. The extremely low star count suggests limited community validation—you're largely on your own if you hit issues or want to extend the examples.

Verdict

Use if: You're teaching or learning RAG fundamentals and want to understand what actually happens beneath framework abstractions. This is perfect for workshops, study groups, or personal experimentation where the goal is conceptual clarity rather than production deployment. It's also valuable if you're evaluating whether to build RAG capabilities in-house and need to understand the underlying complexity before committing to a framework. Skip if: You need production-ready code, comprehensive documentation, or a foundation to build upon. For any real application, start with LangChain or LlamaIndex—they've solved the hard problems around chunking, evaluation, and error handling. Also skip if you're looking for a quick RAG solution; the learning curve here is intentional, and you'll move faster with higher-level abstractions once you understand the basics.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/llm-engineering/vmsv-llmsnippet.svg)](https://starlog.is/api/badge-click/llm-engineering/vmsv-llmsnippet)