SRMT: Teaching Robot Swarms to Navigate Using Shared Attention Instead of Radios
Hook
What if dozens of warehouse robots could coordinate without ever sending a single message to each other? SRMT explores whether shared memory and attention mechanisms can enable coordination in multi-agent navigation, offering an alternative to traditional communication protocols.
Context
The classic multi-agent pathfinding problem—getting multiple robots from point A to point B without collisions—has traditionally been solved two ways: classical optimization algorithms that guarantee optimal paths but can’t adapt to dynamic environments, or reinforcement learning approaches where agents explicitly communicate through message passing. Both have fundamental limitations.
SRMT (Shared Recurrent Memory Transformer) takes a different approach inspired by transformer architectures. Instead of agents broadcasting messages or a central planner computing routes, each agent writes to and reads from a shared memory space using attention mechanisms. This collaborative work by AIRI, DeepPavlov.ai, and the London Institute for Mathematical Sciences targets ‘lifelong’ pathfinding scenarios where agents continuously receive new navigation goals—think warehouse environments where robots perpetually ferry items, not single-shot puzzle games. The core idea is that implicit coordination through shared attention may offer advantages in scalability and flexibility, particularly in dense multi-agent environments.
Technical Insight
At its core, SRMT replaces traditional RNN-based agent controllers with attention-based cores that operate over shared memory. Each agent maintains its own policy network, but instead of isolated recurrent hidden states, all agents read from and write to a common memory bank.
The training setup uses PPO (Proximal Policy Optimization) with grid-based pathfinding environments. What makes SRMT particularly instructive is its comprehensive exploration of reward shaping strategies. The repository provides five distinct reward formulations, each exploring different aspects of multi-agent learning dynamics. The sparse reward function is the simplest, granting rewards only when agents reach goals with no intermediate feedback. You’d train it like this:
python train.py --experiment=sparse_run \
--attn_core=true \
--use_rnn=false \
--core_memory=true \
--const_reward=true \
--intrinsic_target_reward=0 \
--seed=42
Notice the flags: --attn_core=true enables the transformer-style attention mechanism, --use_rnn=false explicitly disables traditional recurrent cores, and --core_memory=true activates the shared memory pool. In contrast, the dense reward function provides continuous feedback based on proximity to goals.
The directional reward variants offer additional options. The standard directional approach (--target_reward=true --positive_reward=true --intrinsic_target_reward=0.005) rewards agents for moving closer to their goals, providing shaped guidance. The directional negative variant (--target_reward=true --reversed_reward=true) inverts this, penalizing agents for moving away from targets. The moving negative reward (--any_move_reward=true) discourages unnecessary motion, potentially encouraging agents to wait when paths are blocked.
The attention mechanism operates differently than standard language transformers. Rather than attending over a sequence of tokens, each agent’s policy network attends over memory slots that encode spatial and temporal information about the environment and other agents’ states. This shared memory acts as an implicit communication channel: when Agent A updates memory after observing an obstacle, Agent B’s attention mechanism can incorporate that information when planning its next move, even though no direct message was passed.
The codebase structure follows a clean separation between environment logic, model architecture, and training loops. The training_config.py file exposes numerous hyperparameters as command-line arguments, making ablation studies straightforward. Checkpoints are saved per experiment, and the example.py script can generate animations of trained agents navigating environments—useful for debugging unexpected behaviors.
For performance optimization, the README recommends setting environment variables to restrict NumPy CPU threads:
export OMP_NUM_THREADS="1"
export MKL_NUM_THREADS="1"
export OPENBLAS_NUM_THREADS="1"
Gotcha
SRMT is fundamentally a research implementation, and the repository reflects this orientation. Documentation is minimal beyond the basic training commands shown in the README. The README provides no guidance on hyperparameter tuning, convergence expectations, or typical training times—critical information if you’re trying to reproduce results or adapt the approach.
More significantly, there are no pre-trained models included. Every experiment requires training from scratch, which for multi-agent RL can require significant compute time depending on environment complexity and agent counts. The repository also lacks detailed documentation of the environments or datasets used, making it more challenging to quickly evaluate whether the approach suits your needs.
The implementation is designed for grid-based environments with discrete action spaces. The shared memory concept is theoretically general, but adapting the actual implementation to different problem domains will require code modifications. The repository acknowledges inspiration from the ‘Learn to Follow’ repository but provides limited architectural documentation beyond what’s in the associated paper.
Evaluation is done via the eval.py script, but the README provides no details about what metrics are computed or how to interpret results.
Verdict
Use SRMT if you’re a researcher exploring coordination mechanisms for multi-agent systems and want a concrete implementation of shared memory attention that goes beyond toy examples. The five reward shaping variants provide an excellent case study in how reward design affects multi-agent learning, making this valuable educational material even if you don’t adopt the exact architecture. It’s particularly relevant if you’re working on multi-agent pathfinding research, robotics coordination, or game AI where implicit coordination through shared memory might offer interesting properties.
Skip it if you need a production-ready pathfinding solution with extensive documentation and pre-trained models. Also skip if your timeline requires quick prototyping—the minimal documentation means you’ll need to invest time reading the code and associated paper to understand architectural details. The implementation is specifically designed for grid-based discrete environments, so applications in continuous control or 3D spaces will require substantial modifications. For production multi-agent navigation or rapid prototyping, more established frameworks with comprehensive tooling and documentation will likely serve you better, even if they take different architectural approaches.