Onyx: Building ChatGPT-Grade AI Search with Hybrid RAG and MCP Agents
Hook
Most RAG systems fail at research tasks because they rely solely on vector similarity. Onyx claims the top leaderboard spot for deep research by combining hybrid search with agentic workflows—and it runs entirely on your infrastructure.
Context
The explosion of LLM-powered chat interfaces created a paradox for enterprises: ChatGPT and Claude deliver exceptional user experiences, but they require sending sensitive data to third parties. Self-hosted alternatives like Open-WebUI emerged, but they primarily wrapped model APIs without solving the core knowledge retrieval problem. Organizations need their AI to answer questions using internal documents, databases, and proprietary systems—not just the model's training data.
This is where Retrieval-Augmented Generation (RAG) became the industry standard. But early RAG implementations hit a ceiling. Pure vector search works well for semantic similarity but struggles with precise queries, dates, or factual lookups. Keyword search handles exact matches but misses conceptual relationships. Onyx addresses this fundamental limitation with a hybrid architecture that combines both approaches, adds 50+ native connectors to enterprise data sources, and introduces agentic capabilities through the Model Context Protocol (MCP). The result is a platform that transforms any LLM into a knowledgeable assistant over your actual data, with deployment options ranging from sub-1GB Docker containers to distributed production clusters.
Technical Insight
Onyx's architecture makes a critical separation between deployment modes that most platforms miss. The 'Lite' mode strips away infrastructure dependencies to deliver a pure chat interface in under 1GB of memory—no vector database, no background workers, no blob storage. This isn't a demo; it's a fully functional chat UI that connects to any LLM provider. But the real engineering depth appears in 'Standard' mode, where Onyx becomes a complete RAG platform.
The hybrid search implementation is the technical centerpiece. Traditional RAG systems force you to choose between vector embeddings (semantic search) or keyword indexing (exact match). Onyx runs both in parallel and uses a reranking model to merge results. When you query for 'Q3 revenue projections,' the keyword index catches the exact quarter reference while the vector store finds semantically related budget documents. The reranker then orders results by relevance to the actual query. This architecture explains their claimed leaderboard position for research tasks—hybrid search catches what pure vector similarity misses.
Here's what a basic connector configuration looks like when integrating a data source:
# Example: Custom connector configuration for Onyx
from onyx.connectors.base import BaseConnector
from onyx.connectors.models import Document, Section
class CustomDatabaseConnector(BaseConnector):
def __init__(self, connection_string: str, **kwargs):
super().__init__(**kwargs)
self.conn_str = connection_string
def load_credentials(self, credentials: dict) -> None:
# Handle OAuth, API keys, or service account auth
self.credentials = credentials
def poll_source(self, start_time: datetime) -> list[Document]:
# Fetch documents modified since last sync
records = self.fetch_records(start_time)
documents = []
for record in records:
doc = Document(
id=record.id,
sections=[Section(
text=record.content,
link=record.url
)],
source=self.source_type,
metadata={
"updated_at": record.updated_at,
"department": record.department
}
)
documents.append(doc)
return documents
This connector pattern is how Onyx ingests data from 50+ sources. Background workers poll these connectors on schedules, chunk documents, generate embeddings via the model server, and populate both the vector database and keyword index. The separation of concerns—connector logic, embedding generation, and indexing—means you can swap vector databases (Qdrant, Weaviate, Pinecone) without touching connector code.
The Model Context Protocol integration elevates Onyx from a search tool to an agentic platform. MCP allows LLMs to interact with external tools through a standardized interface. In practice, this means your AI can execute code in sandboxes, crawl websites for current information, query APIs, or trigger workflows—not just retrieve documents. The architecture uses a tool-calling loop: the LLM decides which MCP tools to invoke, Onyx executes them in isolated environments, and results feed back into the context window. This is how you build AI that doesn't just answer questions but performs multi-step research tasks.
The deployment architecture reveals production-readiness considerations. Standard mode runs six distinct services: a Next.js frontend, FastAPI backend, background job workers (Celery), model inference server, Redis for caching and task queues, and MinIO for blob storage of large files. Vector databases run separately. This microservices approach allows horizontal scaling—you can run multiple worker instances to handle ingestion spikes or scale inference servers independently. But it also means operational complexity: you're managing service discovery, inter-service authentication, and distributed tracing across half a dozen containers.
The RBAC implementation deserves attention because most open-source AI tools ignore access control. Onyx models permissions at the document level: users only retrieve chunks from sources they're authorized to access. When you connect Confluence or Google Drive, the connector preserves original permissions. An HR document marked private stays private, even if its embeddings sit in a shared vector database. This requires maintaining a permission graph alongside your vector index—non-trivial engineering that explains why it's an Enterprise Edition feature.
Gotcha
The architectural split between Lite and Standard modes sounds elegant until you realize there's no graceful upgrade path. Lite mode doesn't use a database for persistent storage—it's chat-only. If you start with Lite and want to add document search later, you're essentially redeploying from scratch with the full infrastructure stack. For teams evaluating Onyx, this means you can't truly 'start small and scale up' without migration work. Plan your deployment mode from day one.
Enterprise Edition gating is more aggressive than the README suggests. Features like SSO, RBAC, usage analytics, custom code execution (for PII scrubbing or query filtering), and whitelabeling all require commercial licensing. The open-source version gives you RAG and connectors, but production-critical capabilities for real organizations sit behind a paywall. If your compliance team requires SSO and audit logs—and whose doesn't—you're in a sales conversation, not evaluating open source. The code is available, but the governance features aren't. Budget accordingly, or be prepared to build access control yourself.
Verdict
Use if: You're an organization that needs ChatGPT-grade AI chat over proprietary data with data sovereignty requirements, you have the infrastructure expertise to run multi-container deployments with vector databases and message queues, or you specifically need hybrid search capabilities for research-heavy workloads where pure vector similarity fails. Onyx shines when you need 10+ data source connectors, agentic capabilities with external tool execution, and the flexibility to swap LLM providers without vendor lock-in. Skip if: You want simple LLM chat without document retrieval (Open-WebUI is half the complexity), you're a solo developer or small team without ops resources to manage six+ services, you need enterprise features like SSO/RBAC but can't justify commercial licensing (just use ChatGPT Enterprise), or you're building a custom AI application from code (LangChain gives you more control). Onyx targets the middle ground: teams with proprietary data, infrastructure capacity, and requirements that commercial platforms can't meet.