> your AI agent picks dependencies from memory; give it dated facts — try starlog.dev ↗ vet your agent's deps ↗ vibe-coding is fine. vibe-importing isn’t. — try starlog.dev ↗ vibe-importing isn’t fine ↗ your agent has never seen your private packages — try starlog.dev ↗ facts for private packages ↗ a linter for the dependencies your AI agent picks — try starlog.dev ↗ a linter for agent deps ↗

Back to Articles

Building AI Agent Memory That Survives the Session: Inside Redis Agent Memory Server

[ View on GitHub ]

Building AI Agent Memory That Survives the Session: Inside Redis Agent Memory Server

Hook

Your AI agent forgets everything between sessions because you're using the LLM's context window as a database. Redis Agent Memory Server treats memory like the architectural challenge it actually is.

Context

Most AI agents today suffer from goldfish memory syndrome. They're brilliant during a conversation, then completely forget you exist the moment the session ends. Developers typically hack around this by stuffing conversation history into the LLM's context window, which works until you hit token limits, or by dumping everything into a vector database without considering what actually needs to be remembered versus what should be forgotten.

The real problem isn't storage—it's memory management. Human memory doesn't work by recording everything verbatim; we extract meaning, recognize patterns, and file things under useful categories. Redis Agent Memory Server emerged from Redis Labs to solve this architectural gap. It's built specifically for agents that need to maintain context across sessions, extract meaningful information from conversations, and retrieve memories based on semantic similarity rather than keyword matching. With Claude Desktop's Model Context Protocol (MCP) gaining traction and agents becoming more autonomous, the timing makes sense: we need infrastructure that treats agent memory as a first-class architectural concern.

Technical Insight

The architecture splits memory into two tiers that mirror human cognition: working memory for the current session and long-term memory for persistence. Working memory holds raw, unprocessed conversation data with a configurable TTL. Long-term memory stores extracted, structured information—topics, entities, summaries—that survives beyond the session. This separation lets you optimize each layer differently: working memory prioritizes speed and ephemeral storage, while long-term memory focuses on searchability and compression.

The extraction pipeline is where things get interesting. When working memory crosses a threshold, extraction strategies determine what moves to long-term storage. The discrete strategy extracts individual facts and entities. The summary strategy condenses conversations into dense paragraphs. The preferences strategy specifically captures user preferences and behavioral patterns. You can also implement custom strategies by extending the base class:

from agent_memory_server import MemoryExtractor, Memory

class CustomExtractor(MemoryExtractor):
    async def extract(self, messages: list[dict]) -> list[Memory]:
        # Use LiteLLM to call any LLM for extraction
        response = await self.llm.complete(
            model="gpt-4",
            messages=[
                {"role": "system", "content": "Extract action items from this conversation."},
                {"role": "user", "content": str(messages)}
            ]
        )
        
        return [
            Memory(
                content=item,
                metadata={"type": "action_item", "priority": self._classify_priority(item)},
                embedding=await self.embed(item)
            )
            for item in self._parse_response(response)
        ]

The dual-interface design is pragmatic. The REST API gives you standard HTTP endpoints for any client, while MCP integration makes it a native memory provider for Claude Desktop. This means you can use the same memory backend whether you're building a custom web app or extending Claude with persistent memory. The MCP tools expose store_memory, search_memory, and get_session operations that Claude can invoke directly during conversations.

LiteLLM integration deserves attention because it solves the provider lock-in problem elegantly. The server doesn't care whether you're using OpenAI, Anthropic, Bedrock, Ollama, or any of 100+ supported providers. You configure credentials via environment variables and switch models by changing a string:

# Configuration in .env
LLM_MODEL=anthropic/claude-3-5-sonnet-20241022
LLM_EMBEDDING_MODEL=openai/text-embedding-3-small

# Or use local models
LLM_MODEL=ollama/llama3.1
LLM_EMBEDDING_MODEL=ollama/nomic-embed-text

Under the hood, the system maintains a backend factory pattern that abstracts vector database operations. Redis is the default, using RedisVL for vector similarity search combined with RediSearch for full-text indexing. This hybrid approach lets you do semantic queries ("find memories about database optimization"), keyword searches ("exact match: PostgreSQL 15"), or combine both with metadata filters ("semantic search in memories from last week with type=technical_decision").

The deployment model splits between development and production cleanly. In development, use the asyncio backend that runs everything in a single process—API server, extraction workers, and memory management all together. For production, switch to the Docket backend, which pushes extraction tasks to a distributed queue processed by separate worker containers. This separation means your API server stays responsive even when processing expensive LLM extraction calls:

# docker-compose.yml production setup
services:
  api:
    image: redis/agent-memory-server
    environment:
      - BACKEND=docket  # Use distributed queue
      - REDIS_URL=redis://redis:6379
    depends_on:
      - redis
      - worker
  
  worker:
    image: redis/agent-memory-server
    command: python -m agent_memory_server.worker
    environment:
      - REDIS_URL=redis://redis:6379
      - LLM_MODEL=gpt-4-turbo
    deploy:
      replicas: 3  # Scale workers independently

The memory lifecycle is event-driven. When you store a message via the API, it lands in working memory immediately. A background task monitors working memory size. When it exceeds your configured threshold, the extraction strategy kicks off asynchronously. Extracted memories get embeddings generated, then stored in long-term memory with full-text indexes. The original working memory either gets pruned or archived based on your retention policy. Query time combines vector similarity search with keyword matching and metadata filtering to return the most relevant memories ranked by a configurable score.

Gotcha

Authentication is disabled by default in the documentation examples, which is fine for local development but dangerous if you deploy this anywhere accessible. The DISABLE_AUTH flag exists to simplify getting started, but there's no detailed guidance on production authentication patterns. You'll need to implement your own auth layer or put the server behind an API gateway with proper authentication before exposing it to the internet.

The extraction quality depends entirely on your LLM choice and prompt engineering. The built-in strategies use reasonable prompts, but they're generic. For domain-specific applications—medical agents, legal research, financial analysis—you'll spend significant time tuning extraction prompts and potentially building custom strategies. The framework gives you the plumbing, but memory quality is still your responsibility. Also, running extraction on every conversation has real cost implications at scale. With GPT-4, you could easily spend dollars per session if conversations are lengthy. The discrete strategy, which extracts individual facts, can generate 5-10x more LLM calls than the summary strategy. Budget accordingly and monitor your extraction costs carefully, especially during early development when you're experimenting with strategies.

Verdict

Use if: You're building agents that need memory across multiple sessions, you already run Redis infrastructure (or are willing to), you need Claude Desktop integration via MCP, or you want provider flexibility through LiteLLM. The dual-mode architecture makes it viable from prototype through production, and the pluggable extraction strategies save you from building memory management from scratch. Skip if: Your agent is single-session only (just use context window), you're building a simple chatbot without complex memory requirements, you want zero-infrastructure solutions with embedded databases, or the 246-star count makes you nervous about production stability. For simple use cases, LangChain's built-in memory or even a basic vector database will suffice without the operational overhead.