Back to Articles

SRMT: Teaching Robots to Share Their Thoughts Through Memory

[ View on GitHub ]

SRMT: Teaching Robots to Share Their Thoughts Through Memory

Hook

What if your warehouse robots could coordinate perfectly without ever sending a single message to each other? The Shared Recurrent Memory Transformer achieves multi-agent coordination through a counterintuitive approach: agents share thoughts, not commands.

Context

Multi-agent pathfinding is deceptively hard. Put ten robots in a warehouse and ask them to navigate to different destinations, and you'll quickly discover the coordination nightmare: agents block each other, create deadlocks, and waste time waiting. Traditional approaches solve this with explicit communication protocols—agents broadcast their intentions, negotiate right-of-way, and coordinate through message passing.

But communication is expensive and brittle. It requires bandwidth, adds latency, and fails catastrophically when messages get dropped. SRMT takes a radically different approach inspired by how humans navigate crowded spaces: we don't verbally coordinate with strangers on a busy sidewalk, yet we rarely collide. Instead, we maintain a shared mental model of the environment and predict others' intentions. SRMT implements this through a shared memory substrate that all agents can read from and write to, creating implicit coordination without explicit messaging. This research implementation, building on the "Learn to Follow" paper, explores whether attention-based architectures can outperform traditional RNNs when agents share their internal representations.

Technical Insight

The architectural innovation in SRMT lies in how it decouples observation processing from memory sharing. Each agent has its own observation encoder and policy head, but they all read from and write to a common memory tensor. This shared memory acts like a blackboard system where agents implicitly broadcast their intentions and read others' plans.

The core architecture uses a PPO-based reinforcement learning setup with three key components working in concert. First, the attention core (when attn_core=True) replaces traditional RNN cells with multi-head self-attention, allowing agents to selectively focus on relevant parts of the shared memory. Second, the core memory mechanism maintains a fixed-size tensor that persists across timesteps—think of it as a shared working memory that all agents can access. Third, reward shaping strategies guide learning, with five distinct variants ranging from sparse rewards (only at goal) to dense directional rewards that provide fine-grained feedback.

Here's how you'd train agents with the shared memory mechanism:

# Train with attention-based core and shared memory
python train.py --alg_name=ppo \
  --attn_core=True \
  --use_rnn=False \
  --core_memory=True \
  --reward_type=dense \
  --num_agents=8 \
  --grid_size=20 \
  --episode_length=128

The core_memory flag enables the shared substrate. When active, each agent's hidden state gets concatenated into a global memory tensor at every timestep. The attention mechanism then operates over this concatenated representation, letting agents implicitly "see" what others are thinking. This is fundamentally different from value decomposition methods like QMIX or MAPF algorithms like CBS—those coordinate actions or plans, while SRMT coordinates internal representations.

The reward shaping variations reveal deep insights about multi-agent learning dynamics. The sparse reward (reward_type=sparse) gives +1 only when an agent reaches its goal, forcing agents to discover coordination through trial and error. Dense rewards (dense) provide incremental feedback based on distance to goal, speeding up learning but potentially creating local optima. The directional variant (directional) adds bonuses for moving toward the goal, while moving negative (moving_neg) penalizes standing still. Perhaps most interesting is the combination approach that blends multiple signals:

# Dense reward calculation (simplified from source)
def calculate_dense_reward(agent_pos, goal_pos, prev_distance):
    current_distance = manhattan_distance(agent_pos, goal_pos)
    reward = prev_distance - current_distance  # Progress reward
    
    if current_distance == 0:
        reward += 1.0  # Goal bonus
    
    return reward

The lifelong aspect distinguishes this from episodic pathfinding: when an agent reaches its goal, it immediately receives a new destination rather than waiting for episode termination. This creates a continuous stream of pathfinding problems where agents must maintain coordination indefinitely. The shared memory becomes crucial here—agents can encode not just their current goals but also their trajectory history, letting others predict future positions and avoid conflicts proactively.

What makes the attention core particularly effective is how it handles variable numbers of agents. Traditional multi-agent approaches often fix the agent count during architecture design, but self-attention naturally handles variable-length sequences. The memory tensor simply grows or shrinks based on active agents, and the attention weights automatically adjust. This architectural choice mirrors how transformers revolutionized NLP by replacing position-dependent RNNs with position-invariant attention.

The codebase implements this through a clean separation of concerns: algo/ contains the PPO training logic, core/ houses the attention and memory mechanisms, and env/ provides the grid-world pathfinding environment. The environment uses simple integer coordinates but supports complex scenarios like narrow corridors and random obstacle placement—classic pathfinding nightmares that expose coordination failures.

Gotcha

The biggest limitation isn't technical—it's documentation. The repository is essentially a research artifact frozen in time. There's no explanation of how to adapt the environment to different scenarios, no visualization tools to debug why agents are colliding, and no pre-trained models to analyze successful coordination strategies. You'll spend more time reading source code than documentation.

The grid-world environment is hardcoded in ways that make extension painful. Want to test in continuous spaces? You'll rewrite substantial environment logic. Need to integrate with standard multi-agent benchmarks like SMAC or MPE? Prepare for significant adapter code. The reward functions, while well-implemented for pathfinding, don't generalize to other multi-agent tasks like formation control or resource gathering. The shared memory mechanism is architecturally elegant but computationally expensive—memory size scales linearly with agent count, and attention complexity scales quadratically. Beyond 20-30 agents, you'll hit performance walls that require careful optimization or architectural changes the codebase doesn't support out-of-box.

Verdict

Use SRMT if you're researching novel multi-agent coordination mechanisms, need a strong baseline for comparing shared-memory approaches against communication-based methods, or want to understand how attention architectures apply to multi-agent RL beyond typical benchmarks. The reward shaping ablations alone provide valuable experimental templates. Skip it if you need production-ready multi-agent pathfinding (look at dedicated MAPF solvers like EECBS instead), want comprehensive documentation and community support (try RLlib or EPyMARL), or need to scale beyond small agent counts in grid worlds. This is a reference implementation for reproducing academic results, not a library for building products. Clone it to learn from the architectural ideas, but expect to significantly modify or reimplement core components for real applications.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/llm-engineering/aloriosa-srmt.svg)](https://starlog.is/api/badge-click/llm-engineering/aloriosa-srmt)