Honcho: The Peer Memory Graph That Replaces RAG for Long-Running Agents
Hook
Your chatbot doesn't have amnesia—it has a RAG problem. Every conversation forces the LLM to re-synthesize facts from embedded chunks, burning tokens to re-learn what it already knew yesterday.
Context
Most agent memory systems treat conversations like search problems: embed messages, retrieve relevant chunks at query time, and hope the LLM synthesizes coherent answers. This retrieval-augmented generation (RAG) approach works for single-session Q&A, but fails catastrophically for long-running relationships. Your tutoring bot forgets a student's learning style between sessions. Your support agent re-asks questions it asked last week. The fundamental issue: RAG systems retrieve raw messages and make the LLM reason from scratch every time, rather than incrementally building structured knowledge about users.
Honcho, an open-source memory library from Plastic Labs, takes a different approach: it models memory as a graph of peer relationships and pre-computes structured facts through background reasoning pipelines. Instead of storing messages in a generic vector database and retrieving at query time, Honcho treats humans and AI agents as "peers" who observe each other across "sessions," running async jobs to extract "conclusions"—structured beliefs stored separately from raw conversation history. The architecture trades write latency (messages trigger background reasoning) for consistent read performance on high-level queries like "what does Alice prefer?" without re-reading entire conversation histories. With 4,700+ GitHub stars and native Model Context Protocol (MCP) support for coding agents like Claude Code and Cursor, Honcho targets teams building customer-facing agents where relationships evolve over weeks, not just within a single chat.
Technical Insight
Honcho's core innovation is its peer-centric data model, which treats memory as a graph of observations rather than a document store. The system uses a four-level hierarchy: Workspaces (tenant isolation) contain Peers (humans or AI agents), which participate in Sessions (conversation contexts with many-to-many peer relationships), which contain Messages. The crucial design choice: internally, Honcho maintains Collections of vector-embedded Documents keyed by (observer, observed) peer tuples. This means the system doesn't just store "what was said"—it tracks "what peer X knows about peer Y," enabling cross-peer modeling that handles multi-agent scenarios RAG systems struggle with.
Here's how you initialize a session and add messages that trigger background reasoning:
from honcho import Honcho
client = Honcho(api_key="your_key")
# Create workspace and peers
workspace_id = client.workspaces.create(name="tutoring_app").id
student = client.peers.create(workspace_id=workspace_id, name="alice")
tutor = client.peers.create(workspace_id=workspace_id, name="tutor_bot")
# Start a session with both peers
session = client.sessions.create(
workspace_id=workspace_id,
peer_ids=[student.id, tutor.id]
)
# Add messages - these trigger async reasoning
client.messages.create(
session_id=session.id,
content="I learn best with visual examples",
is_user=True,
peer_id=student.id # Message from Alice
)
client.messages.create(
session_id=session.id,
content="Got it, I'll use diagrams when explaining",
is_user=False,
peer_id=tutor.id # Tutor's response
)
When you create messages, Honcho doesn't just store them—it queues background jobs that run three reasoning pipelines: deriver (extracts atomic facts from messages), summary (maintains rolling summaries of session themes), and dialectic (reconciles contradictory beliefs about peers). These pipelines write to the Conclusions table, a structured store of extracted facts separate from raw messages. The key architectural tradeoff: you can't immediately query conclusions after writing messages (eventual consistency), but reading peer representations becomes a fast lookup rather than an LLM call over conversation history.
Querying memory exposes two surfaces with different latency/intelligence tradeoffs:
# Fast static read: pre-computed peer representation
card = client.peers.get_card(
workspace_id=workspace_id,
peer_id=student.id,
observer_id=tutor.id # What does tutor know about Alice?
)
print(card.content) # Returns cached summary of Alice's preferences
# Slower dynamic read: LLM-powered Q&A over conclusions
response = client.peers.chat(
workspace_id=workspace_id,
peer_id=student.id,
observer_id=tutor.id,
query="What learning styles has Alice mentioned?"
)
print(response.content) # LLM synthesizes answer from conclusions
The get_card endpoint returns cached peer summaries updated by background jobs—sub-100ms latency but potentially stale. The chat endpoint runs a fresh LLM query over the conclusions table—higher latency but answers nuanced questions the static card can't address. This dual API acknowledges that "memory" isn't one operation: sometimes you want fast cached context (prefilling a chatbot's system prompt), sometimes you want intelligent querying ("has Alice ever mentioned deadlines?").
Honcho's MCP (Model Context Protocol) integration demonstrates clever positioning as infrastructure for coding agents. Instead of building per-editor plugins, Honcho's FastAPI server speaks HTTP-based MCP natively, exposing memory as a "resource" any MCP-compatible tool can access:
# MCP server config (honcho runs this internally)
from mcp.server import MCPServer
server = MCPServer("honcho-memory")
@server.resource("memory://peer/{peer_id}")
async def get_peer_memory(peer_id: str):
# MCP clients (Claude Code, Cursor) fetch this URL
card = await honcho.peers.get_card(peer_id=peer_id)
return {"content": card.content, "mime_type": "text/plain"}
When you connect Claude Code to Honcho via MCP, the editor fetches memory://peer/alice URLs to inject user context into code generation prompts—no client-side plugin required. This architecture scales to any MCP-compatible tool without per-editor integration work, a significant deployment advantage over systems like Mem0 that require custom LangChain/LlamaIndex wrappers.
The multi-model orchestration is pragmatic: Honcho uses different LLM providers for different reasoning tasks (Gemini for cheap summarization, Anthropic for complex dialectic inference, OpenAI for embeddings), exposing configurable routing rather than locking into a single provider. The system's configuration allows swapping models per reasoning stage:
# Configurable in deployment (docker-compose environment)
DERIVER_MODEL=gemini-1.5-flash # Cheap fact extraction
DIALECTIC_MODEL=claude-3-5-sonnet # Complex belief reconciliation
EMBEDDING_MODEL=text-embedding-3-small # Vector search
This flexibility matters for cost optimization: the deriver runs on every message (high volume, simple task → cheap model), while dialectic runs periodically to resolve contradictions (low volume, complex reasoning → expensive model). Teams can tune the cost/intelligence tradeoff per pipeline stage rather than being forced into a single provider's pricing.
Gotcha
Honcho's background reasoning introduces eventual consistency that breaks real-time use cases. When you create a message, conclusions don't update immediately—they're queued for async processing. If your application needs sub-second write-to-read latency (a user asks a question, expects their just-sent message to influence the answer immediately), Honcho will return stale results. The SDK provides no built-in polling mechanism for queue status, forcing developers to either accept staleness or implement custom retry logic. This disqualifies Honcho for chatbots where memory must reflect the current turn, favoring systems like Zep that prioritize session-scoped memory with synchronous updates.
The AGPL-3.0 license creates significant commercial friction. If you build a SaaS product using Honcho, the license requires you to open-source your backend modifications—including custom reasoning pipelines, access control layers, or integrations. For proprietary agent platforms, this is a non-starter. Mem0 (MIT) and LangMem (Apache 2.0) offer similar peer-graph capabilities with more permissive licensing, sacrificing Honcho's MCP integration and managed service convenience. The lack of built-in access control beyond workspace-level isolation compounds this: multi-tenancy requires external auth and workspace-per-customer provisioning, adding operational overhead compared to vector databases with native namespaces (Pinecone) or RBAC (Weaviate). You'll spend engineering time building tenant isolation that Honcho doesn't provide out of the box.
The reasoning pipeline configuration is opaque—Honcho's deriver, summary, and dialectic stages use hardcoded prompts and model assignments with no API for custom extractors. If your domain needs specialized fact extraction (medical terminology, legal reasoning), you can't inject custom prompts or chain additional reasoning steps without forking the codebase. This rigidity contrasts with LangChain's memory modules, where you control every summarization and extraction strategy through composable chains. Honcho optimizes for "memory that just works" at the cost of configurability, making it poorly suited for teams with domain-specific reasoning requirements.
Verdict
Use if: You're building customer-facing agents (tutors, support bots, personal assistants) where relationships evolve over weeks and you need structured memory without building reasoning pipelines from scratch. Honcho's peer-centric model and background jobs handle long-conversation memory vastly better than Pinecone + manual summarization, and the MCP integration gives coding agents like Claude Code persistent memory with zero per-editor plugin work. The managed service (with $100 free credits) eliminates pgvector ops overhead, making this the fastest path to production for product engineers who want memory to "just work."
Skip if: You need sub-second write-to-read latency (the async reasoning queue disqualifies real-time retrieval), if AGPL-3.0 conflicts with your SaaS product (you'll need MIT alternatives like Mem0), if you're building single-session chatbots where stateless RAG suffices, or if you need granular control over vector indexing and reranking that Weaviate/Qdrant expose. Also skip if your domain requires custom reasoning logic—Honcho's hardcoded pipelines don't support plug-in extractors without forking the repo. This is infrastructure for shipping memory-enabled products quickly, not for ML engineers optimizing retrieval research.