Back to Articles

Cortex: Composable Recurrent Architectures for Agent Memory Systems

[ View on GitHub ]

Cortex: Composable Recurrent Architectures for Agent Memory Systems

Hook

While the AI world obsesses over transformers, a quiet reality persists: agents acting in real-time environments can’t afford quadratic attention costs on unbounded sequential contexts—they need stateful memory that updates in constant time.

Context

The transformer revolution solved many sequence modeling problems, but it introduced a fundamental constraint for agent systems: every inference step requires attending to the entire context window. For an agent processing a continuous stream of observations—a robot navigating a warehouse, a trading system monitoring market feeds, or a dialogue system in an hours-long conversation—this quadratic cost becomes prohibitive. You either truncate context aggressively, losing long-term coherence, or watch your inference budget explode.

Recurrent architectures offer constant-time updates by compressing history into a fixed-size state vector, but the ecosystem fragmented. PyTorch provides basic LSTM and GRU cells, but composing them with modern techniques like mixture-of-experts routing, layer normalization variants, or novel memory mechanisms like mLSTM requires reinventing projection layers, state management, and dimension tracking for each experiment. Cortex attempts to solve this architectural plumbing problem by providing a uniform interface across all abstraction levels—from individual recurrent cells to deep stacks with expert routing—so researchers can prototype agent memory systems without rebuilding infrastructure.

Technical Insight

Column (MoE Routing)

Stack (Deep Architecture)

(hidden, state)

(hidden, state)

(hidden, state)

routing weights

expert outputs

expert outputs

instantiated per layer

instantiated per expert

Block (Scaffolding)

stateful recurrence

Projection Layers

Cell (LSTM/GRU/xLSTM)

Normalization + Residual

Input IDs + State + Resets

Layer 1: Axon Column

(MoE: 2 experts)

Layer 2: TransformerXL Column

(MoE: 2 experts)

Layer 3: mLSTM Column

(MoE: 2 experts)

Layer 4: sLSTM Column

(MoE: 2 experts)

Gating Network

Block 1

Block 2

Weighted Mix

Logits + New State

System architecture — auto-generated

Cortex’s core insight is radical interface uniformity. Every component, whether a single LSTM cell or a 32-layer stack with mixture-of-experts routing, exposes the same signature: accepts (input, state, optional_resets) and returns (output, new_state). This sounds simple, but the implications are profound—you can substitute a single cell with an entire multi-layer stack without changing calling code.

The library organizes around four abstractions stacked like Russian dolls. At the bottom, Cells implement stateless recurrent computation (LSTM, GRU, the newer mLSTM and sLSTM variants from xLSTM). One level up, Blocks wrap cells with projection layers, normalization, and residual connections—the architectural scaffolding that cells themselves shouldn’t care about. Columns introduce mixture-of-experts routing, allowing multiple blocks to process each token with learned gating. Finally, Stacks compose multiple columns into deep architectures.

Here’s how you build a heterogeneous recurrent stack mixing different memory mechanisms:

from cortex import Stack

# Build a 4-layer stack: Axon -> TransformerXL -> mLSTM -> sLSTM
stack = Stack(
    d_model=512,
    d_hidden=2048,
    vocab_size=50257,
    layers='AXMS',  # DSL pattern for layer types
    blocks_per_layer=2,  # MoE with 2 experts per position
    num_heads=8,
    device='cuda'
)

# Forward pass looks identical to a single cell
logits, new_state = stack(input_ids, state)

# Reset state for episode boundaries (agent applications)
logits, new_state = stack(input_ids, state, resets=episode_done_mask)

The layers='AXMS' DSL is Cortex’s architectural shorthand: each character specifies an expert type per layer. Under the hood, Stack handles dimension inference top-down—you specify d_hidden once, and projection dimensions cascade through all blocks automatically. This eliminates the manual bookkeeping plague of composing PyTorch modules where you track input/output dimensions across every layer boundary.

The real performance story lives in Cortex’s Triton kernels. Recurrent computations suffer from sequential dependencies that prevent naive parallelization, but fused implementations can eliminate memory round-trips. Cortex ships CUDA kernels for LSTM, mLSTM (extended LSTM with exponential gating), and sLSTM (scalar memory variant) that fuse the cell computation into single kernels:

from cortex.nn.cells import mLSTMCell

# Uses Triton kernel on CUDA, PyTorch fallback on CPU
cell = mLSTMCell(d_model=512, device='cuda')

# Single fused operation instead of 10+ separate tensor ops
output, (h_new, c_new, n_new, m_new) = cell(
    x,  # (batch, d_model)
    (h, c, n, m),  # previous state tuple
)

The mLSTM cell maintains four state tensors (h, c, n, m) for its extended gating mechanism. In pure PyTorch, this would require dozens of separate matrix multiplications, pointwise operations, and memory allocations. The Triton kernel fuses these into a single GPU dispatch, reducing memory bandwidth by ~3x according to the library’s benchmarks.

The Column abstraction enables mixture-of-experts routing at the layer level. Instead of a single block processing each token, multiple expert blocks compete with learned routing weights:

from cortex import Column

column = Column(
    d_model=512,
    block_types=['mLSTM', 'sLSTM', 'Axon'],  # Three expert types
    router='TopK',  # Route to top-K experts
    k=2,  # Use best 2 experts per token
    d_hidden=2048,
)

# Router learns which experts handle which tokens
output, new_states = column(x, states)

Cortex stabilizes MoE training with E-axis normalization (normalize across experts before routing) and ReZero initialization (scale expert contributions by learnable scalars starting near zero). These are research-grade features rarely found in general-purpose recurrent libraries.

The architectural flexibility shines for agent systems with episodic resets. Many RL environments need to clear memory between episodes without reconstructing models:

# resets is a boolean tensor: True = reset state, False = preserve
resets = torch.tensor([False, True, False])  # Reset middle sequence

output, new_state = stack(
    obs_batch,
    state,
    resets=resets  # Automatically handled at every layer
)

Every component respects the reset signal, zeroing state for marked sequences while preserving others—crucial for batched environments where episodes terminate asynchronously.

Gotcha

Cortex’s biggest liability is its maturity—or lack thereof. With four GitHub stars and truncated documentation (the README literally cuts off mid-example at the 8000-character mark), you’re adopting a library with near-zero production validation. Expect breaking API changes, incomplete edge case handling, and the very real possibility of debugging library internals when things break. There’s no community to ask for help, no Stack Overflow answers, no battle-tested deployment patterns.

The Triton kernel acceleration, while impressive on paper, locks you into NVIDIA CUDA. The library falls back to PyTorch implementations on CPU, but performance characteristics change dramatically—what runs fast in development on your RTX 4090 might crawl on an AWS CPU instance during integration testing. Metal (Apple Silicon) and ROCm (AMD) support is unclear from the documentation. More fundamentally, Triton kernels add a compilation step and another dependency that can break in opaque ways when CUDA versions mismatch or driver updates introduce subtle numerical differences. The library claims numerical parity with PyTorch reference implementations, but verifying this across all edge cases (varying batch sizes, sequence lengths, dtype combinations) requires trust in a four-star repository.

Verdict

Use Cortex if you’re in the research phase of building agent memory systems and need to rapidly prototype combinations of modern recurrent mechanisms—mixing mLSTM’s exponential gating with mixture-of-experts routing, or comparing different memory cell architectures without rewriting projection logic each time. The compositional design pays off when you’re exploring architectural spaces rather than optimizing a known solution. It’s ideal for academic projects, internal R&D experiments, or proof-of-concept agent prototypes where cutting-edge recurrent techniques matter more than ecosystem maturity. Skip it for any production deployment where reliability, maintainability, and community support matter. The low adoption means you’re on your own when things break, and the incomplete documentation will cost you hours reconstructing intended usage from source code. Also skip if you need CPU-only or non-CUDA deployment—the performance story depends entirely on those Triton kernels. For production agent systems, stick with battle-tested options like Hugging Face Transformers (even if you manually stack LSTMs) or proven state-space models like Mamba that have active communities and deployment track records.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/ai-agents/metta-ai-cortex.svg)](https://starlog.is/api/badge-click/ai-agents/metta-ai-cortex)