Back to Articles

SRMT: When Multi-Agent Pathfinding Meets Shared Memory Transformers

[ View on GitHub ]

SRMT: When Multi-Agent Pathfinding Meets Shared Memory Transformers

Hook

What if instead of agents shouting coordinates at each other, they could think together through a shared memory substrate?

Context

Multi-agent pathfinding has traditionally been solved through two divergent approaches: classical algorithms like Conflict-Based Search that guarantee optimal solutions but don’t adapt to novel scenarios, and independent reinforcement learning agents that learn but struggle with coordination. The gap between these worlds—adaptability versus coordination—has proven stubbornly persistent.

SRMT (Shared Recurrent Memory Transformer) emerged from this tension, proposing that agents don’t need explicit communication protocols if they can read and write to a common memory space. Rather than agents passing messages or learning communication channels, SRMT treats multi-agent coordination as a problem of shared cognition. The architecture replaces traditional RNN cores in multi-agent RL with Transformer attention mechanisms, creating a memory substrate that multiple agents access simultaneously. This is particularly relevant for ‘lifelong’ pathfinding scenarios where agents continuously receive new navigation targets rather than solving static puzzles—think warehouse robots or traffic management systems that never actually stop operating.

Technical Insight

RL Training Loop

SRMT Core

Agent Observations

Feature Encoder

Agent Features

Multi-Head Attention

Shared Core Memory

Attended Memory

Memory Update

Policy Head

Value Head

Agent Actions

Value Estimates

Environment

Rewards

PPO Training

System architecture — auto-generated

The core architectural innovation in SRMT lies in how it structures agent memory. Traditional multi-agent RL gives each agent its own hidden state, forcing coordination to emerge through environment interactions or explicit communication channels. SRMT inverts this: agents share a core_memory tensor that serves as a common representational substrate.

The system architecture has three key components working in concert. First, each agent encodes its local observation (a gridworld view) into a feature vector. Second, these features are processed through an attention mechanism that reads from and writes to the shared memory. Third, the updated memory state feeds into policy and value heads for action selection. Here’s how the shared memory update looks conceptually:

# Simplified SRMT memory update (based on repo structure)
class SRMTCore(nn.Module):
    def __init__(self, hidden_size, num_heads, memory_size):
        self.attention = MultiHeadAttention(hidden_size, num_heads)
        self.memory_projection = nn.Linear(hidden_size, memory_size)
        
    def forward(self, agent_features, shared_memory):
        # agent_features: [num_agents, hidden_size]
        # shared_memory: [memory_slots, memory_size]
        
        # Each agent queries the shared memory
        queries = agent_features
        attended_memory = self.attention(
            query=queries,
            key=shared_memory,
            value=shared_memory
        )
        
        # Combine agent features with memory context
        updated_features = agent_features + attended_memory
        
        # Write back to shared memory (simplified)
        memory_updates = self.memory_projection(updated_features)
        shared_memory = shared_memory + memory_updates.mean(dim=0)
        
        return updated_features, shared_memory

The attention mechanism is crucial here. When attn_core=true in the configuration, SRMT uses Transformer-style attention rather than LSTM cells. This allows agents to selectively read relevant coordination information from memory based on their current context, rather than processing a fixed hidden state. An agent approaching an intersection can attend to memory slots encoding other agents’ intended paths, while an agent in open space ignores that information.

The training loop uses PPO (Proximal Policy Optimization) but with multiple reward shaping strategies that reveal the pathfinding challenge’s nuances. The repository implements five variants: sparse (reward only on goal), dense (reward proportional to progress), directional (bonuses for moving toward goals), combined (hybrid approach), and custom formulations. This flexibility matters because multi-agent pathfinding suffers from severe credit assignment problems—when ten agents simultaneously navigate a grid, which agent’s actions caused the eventual success or collision?

# Example reward configuration from training
reward_configs = {
    'sparse': {'on_goal': 1.0, 'per_step': 0.0},
    'dense': {'on_goal': 1.0, 'per_step': -0.01, 'closer_to_goal': 0.1},
    'directional': {'on_goal': 1.0, 'correct_direction': 0.05}
}

The gridworld environment itself operates on discrete timesteps where agents observe a local patch around their position and must output movement actions (up, down, left, right, stay). The ‘lifelong’ aspect means that upon reaching a target, an agent immediately receives a new randomly assigned destination. This creates a continuous coordination problem where the optimal policy must handle arbitrary agent configurations and goal assignments—much harder than solving a single fixed scenario.

What makes this architecture particularly interesting for researchers is the implicit coordination mechanism. Agents never explicitly communicate intentions, yet through the shared memory, information about traffic patterns, congested areas, and preferred routes naturally accumulates. The attention mechanism learns to query this emergent knowledge when making decisions. In effect, the shared memory becomes a learned, differentiable blackboard system.

Gotcha

SRMT is unambiguously a research artifact, not production software, and that distinction matters significantly. The repository provides no pretrained models, no benchmark results, and critically, no guidance on compute requirements or expected training times. You’ll be training from scratch with hyperparameters that may or may not converge on your hardware. The only documentation consists of command-line training examples—there are no architectural diagrams, no API references, and no discussion of when the approach succeeds versus fails.

The scope limitations are equally significant. SRMT solves gridworld pathfinding, full stop. The architecture is tightly coupled to discrete grid observations and movement actions. Extending this to continuous control, 3D navigation, or richer action spaces would require substantial rearchitecting, not configuration changes. The 34 stars on GitHub reflect this narrow applicability—this isn’t a library you import, it’s a paper implementation you study and potentially adapt. Additionally, the shared memory approach scales polynomially with agent count (due to attention mechanisms), so the sweet spot appears to be 5-20 agents rather than the hundreds you’d find in large-scale swarm scenarios. If you’re exploring commercial pathfinding applications or need something that works reliably out of the box, this will frustrate you quickly.

Verdict

Use SRMT if you’re researching novel multi-agent coordination mechanisms, writing a paper that needs a shared memory baseline, or want to understand how attention mechanisms can replace traditional RNN cores in multi-agent RL. It’s valuable as a reference implementation for reproducing specific results or as a conceptual starting point for your own shared cognition architectures. Skip it if you need production-ready pathfinding solutions, comprehensive documentation for rapid onboarding, applications beyond gridworld environments, or pretrained models you can deploy immediately. For practical multi-agent pathfinding, classical algorithms like CBS remain faster and more reliable, while general-purpose MARL frameworks like QMIX offer better infrastructure. SRMT occupies a narrow niche: it’s a research artifact that demonstrates one promising approach to implicit coordination, best suited for academics and advanced practitioners who can invest time in understanding and extending the core ideas rather than deploying code directly.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/llm-engineering/aloriosa-srmt.svg)](https://starlog.is/api/badge-click/llm-engineering/aloriosa-srmt)