Back to Articles

R2R: The Production RAG System That Actually Ships With User Management

[ View on GitHub ]

R2R: The Production RAG System That Actually Ships With User Management

Hook

Most RAG tutorials end with 'now add user management, authentication, and multi-tenancy as an exercise for the reader.' R2R is what happens when someone actually does that exercise.

Context

The RAG landscape has a painful gap. On one side, you have flexible frameworks like LangChain and LlamaIndex that are perfect for prototyping—you can have a semantic search demo running in 50 lines of code. On the other side, you have production requirements: user authentication, document-level permissions, collection management, hybrid search, API rate limiting, and observability. The journey from prototype to production typically involves months of infrastructure work, bolting together authentication systems, building REST APIs, implementing proper database schemas, and handling all the edge cases that emerge when real users touch your system.

R2R (short for 'Retrieval-Augmented Retrieval') positions itself in this gap as a 'batteries-included' RAG system. It's not a framework for building RAG—it's a complete RAG application with a RESTful API that you can deploy and scale. Think of it as the difference between React (a library you build with) and Next.js (a framework with opinions and infrastructure). R2R makes architectural decisions for you: PostgreSQL for metadata, pgvector for embeddings, FastAPI for the REST layer, and a pipeline architecture for document processing. This opinionated approach means less flexibility but dramatically faster time-to-production for teams building standard RAG applications.

Technical Insight

R2R's architecture revolves around three core abstractions: pipelines, providers, and collections. Pipelines handle document ingestion and search operations as composable stages. Providers are pluggable implementations for embeddings, LLMs, vector databases, and knowledge graphs. Collections group documents with shared access controls, enabling multi-tenancy out of the box.

The ingestion pipeline demonstrates this design. When you upload a document, it flows through parsing (extracting text from PDFs, DOCX, audio), chunking (breaking content into retrievable segments), embedding (vectorizing chunks), and storage (persisting to PostgreSQL + pgvector). Here's what a basic document ingestion looks like:

from r2r import R2RClient

client = R2RClient("http://localhost:7272")

# Authenticate
client.login("user@example.com", "password")

# Ingest with metadata and collection assignment
result = client.ingest_files(
    file_paths=["./research_paper.pdf"],
    metadatas=[{
        "title": "Attention Is All You Need",
        "year": 2017,
        "department": "research"
    }],
    document_ids=["doc_transformer_paper"],
    collection_ids=["ai_research_collection"]
)

What's noteworthy here is what you don't see: no manual chunking strategy, no embedding model selection, no vector database connection management. R2R handles these with sensible defaults while remaining configurable through its provider system.

The hybrid search implementation is where R2R differentiates itself from simple vector similarity solutions. It combines semantic search (dense embeddings) with keyword search (BM25) using reciprocal rank fusion to merge results. This matters because pure semantic search fails on exact matches—searching for 'GPT-4' might return documents about 'language models' when you want literal mentions. The search API exposes this:

# Hybrid search with knowledge graph filtering
results = client.search(
    query="What are the architectural differences between transformers and RNNs?",
    search_settings={
        "use_hybrid_search": True,
        "use_semantic_search": True,
        "use_fulltext_search": True,
        "semantic_search_weight": 0.7,
        "fulltext_search_weight": 0.3,
        "filters": {"department": "research"},
        "limit": 10
    }
)

The agentic layer—what R2R calls 'Deep Research'—is the most ambitious component. Rather than simple retrieve-and-generate RAG, it orchestrates multi-step reasoning. Give it a complex query like 'Compare the carbon footprint of training GPT-3 versus running inference at scale for a year,' and it breaks this into sub-questions, retrieves relevant information for each, potentially searches the internet for missing data, and synthesizes a comprehensive answer. This uses a ReAct-style agent loop with tool access:

# Agentic research with extended thinking
research = client.agent(
    messages=[
        {"role": "user", "content": "Analyze the security implications of using RAG systems with proprietary data"}
    ],
    rag_generation_config={
        "model": "anthropic/claude-3-5-sonnet-20241022",
        "temperature": 0.7,
        "thinking": {
            "type": "extended",
            "budget_tokens": 10000
        }
    },
    search_settings={
        "use_hybrid_search": True
    }
)

The knowledge graph integration adds another retrieval dimension. During ingestion, R2R can extract entities and relationships, storing them in a graph structure. This enables relationship-aware retrieval—finding documents not just by content similarity but by entity connections. For a query about a specific researcher, it can traverse the graph to find co-authors, cited papers, and related institutions.

Deployment comes in two flavors. 'Light mode' runs entirely in Python with minimal dependencies, using in-memory or file-based storage—useful for development or small-scale deployments. 'Full mode' is Docker Compose-based with PostgreSQL, Redis for caching, and Hatchet for orchestration. The full stack handles user management with email verification, collection-based access control, and API key authentication. This infrastructure depth is both R2R's strength and its burden—you get production features immediately, but you're also committing to operating this stack.

Gotcha

R2R's comprehensiveness cuts both ways. The system makes strong architectural assumptions—PostgreSQL for metadata, pgvector for embeddings, FastAPI for the API layer. If your organization has standardized on different infrastructure (say, MongoDB and Elasticsearch), you're fighting against R2R's design rather than working with it. The provider abstraction theoretically allows swapping components, but in practice, much of the system assumes PostgreSQL's capabilities.

The agentic features and knowledge graph extraction introduce significant computational overhead. Running entity extraction on every ingested document and maintaining graph structures means higher processing costs and slower ingestion compared to simple chunk-and-embed pipelines. For straightforward Q&A use cases where users ask direct questions about known documents, you're paying for sophistication you don't need. The documentation acknowledges this by offering light mode, but then you lose the production features that justify choosing R2R in the first place.

API stability is a legitimate concern. The project shows signs of rapid development—the 'Deep Research' API and extended thinking support are recent additions. The GitHub issues reveal breaking changes between versions, and the documentation sometimes lags behind the codebase. Teams building critical systems should pin versions carefully and budget time for migration work. This is the tax of adopting a fast-moving open-source project versus a mature, stable platform.

Verdict

Use if: You're building a production RAG application that needs enterprise features (user management, multi-tenancy, access controls) and you're willing to adopt R2R's PostgreSQL-based stack. It's especially compelling if you need agentic reasoning capabilities or knowledge graph features and don't want to build them yourself. Teams with 2-10 engineers who want to ship RAG features in weeks rather than months will find the most value—you're trading architectural control for velocity. Skip if: You're prototyping and need maximum flexibility to experiment with different chunking strategies, embedding models, or retrieval approaches—LangChain or LlamaIndex will serve you better. Also skip if you have strong infrastructure preferences that conflict with R2R's stack, if you're building simple semantic search that doesn't need user management complexity, or if API stability is non-negotiable and you can't tolerate breaking changes. For teams already committed to managed services, Vectara offers similar capabilities with zero operational overhead.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/ai-agents/sciphi-ai-r2r.svg)](https://starlog.is/api/badge-click/ai-agents/sciphi-ai-r2r)