Back to Articles

CoGames: Building and Benchmarking Cooperative AI Agents for the Alignment League

[ View on GitHub ]

CoGames: Building and Benchmarking Cooperative AI Agents for the Alignment League

Hook

While most AI benchmarks measure raw performance, the Alignment League Benchmark asks a harder question: can your agent cooperate with strangers it’s never seen before?

Context

Reinforcement learning research has produced agents that master Chess, Go, and StarCraft. But these agents typically play in isolation or against copies of themselves. The real world demands something messier: coordination with unknown partners who may have different training, different objectives, and different behavioral patterns. This is the alignment problem in miniature—can AI systems work productively with entities they weren’t explicitly trained alongside?

CoGames is the game environment framework for the Alignment League Benchmark (ALB), a competitive leaderboard specifically designed to measure cross-play performance. Instead of evaluating agents solely against their training partners, ALB tournaments pit your submitted policy against agents from other researchers, testing generalization to novel coordination partners. The framework provides core components: a game engine for multi-agent scenarios, training infrastructure, and CLI tooling for authentication and leaderboard submission. Currently, it ships with one game—Cogs vs Clips, a cooperative territory control game—though the architecture is designed to support additional games as the benchmark evolves.

Technical Insight

CoGames follows a clean separation between game mechanics, agent policies, and tournament infrastructure. The core abstraction is the policy interface, which your agent implements to participate in games. Policies receive observations and return actions.

The framework includes a CLI tool (cogames) that handles the entire competition workflow. After installation via pip, you authenticate with the Softmax platform, develop your policy locally, and submit it for evaluation. The README provides clear installation paths for both standard pip and uv (the fast Python package installer), plus Docker support for containerized development. Here’s the basic installation:

pip install cogames

For researchers using uv for faster dependency resolution:

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create a virtual environment
uv venv .venv
source .venv/bin/activate

# Install cogames
uv pip install cogames

The framework’s architecture reflects its dual purpose as both a research tool and a competition platform. Locally, you can iterate on agent policies using standard RL training loops. When you’re ready to benchmark your agent’s cross-play capabilities, the CLI tooling abstracts away the submission mechanics—authentication, policy packaging, and leaderboard integration happen through simple command-line operations.

Cogs vs Clips, the flagship game, demonstrates the framework’s focus on coordination challenges. According to the README, it’s “a cooperative territory-control game” where “teams of AI agents (‘Cogs’) work together to capture and defend junctions against automated opponents (‘Clips’)” through resource gathering and role acquisition. The game mechanics require coordination rather than just individual skill, making it a test of whether agents can adapt to unknown partners.

The tournament system is where CoGames differentiates itself from standard multi-agent RL libraries. Rather than just providing environments, it creates a competitive evaluation context. Your submitted policy gets matched against other researchers’ agents in tournament play, generating rankings that measure cross-play performance. This shifts the optimization target: you’re not just training an agent that plays well with its training partners, but one that generalizes to coordination with arbitrary policies.

Gotcha

The most significant limitation is in the scope: CoGames is described as “a collection of multi-agent cooperative and competitive environments,” but currently ships with exactly one game (Cogs vs Clips). If you’re looking for diverse multi-agent scenarios to test different coordination patterns, you’ll find the environment catalog severely limited. The README explicitly states “There’s one ALB game right now: Cogs vs Clips.”

The framework is also tightly coupled to the Softmax Alignment League Benchmark platform. While this integration enables the tournament and leaderboard features, it means you can’t easily use CoGames as a standalone multi-agent RL library. If you want to run large-scale experiments on your own infrastructure without the competition context, or if you need offline evaluation without submitting to a third-party platform, you’ll encounter friction. The 31 GitHub stars suggest this is a young project with limited community adoption—expect sparse external resources and fewer third-party tutorials compared to established alternatives. The documentation appears primarily in the README and linked tutorials rather than a comprehensive external docs site, which may slow onboarding for complex use cases.

Verdict

Use CoGames if you’re specifically participating in the Alignment League Benchmark competition or researching cross-play generalization in cooperative AI. The framework delivers exactly what it promises: a frictionless path from local development to competitive evaluation with integrated leaderboard submission. The focus on coordination with unknown partners addresses a genuine gap in standard RL benchmarks, and the tournament structure creates interesting research incentives around generalization rather than overfitting to training partners. Skip it if you need diverse multi-agent environments for general research, require a mature ecosystem with extensive community resources and documentation, or want flexibility to run experiments independent of a competition platform. This is a purpose-built tool for a specific benchmark, not a general-purpose multi-agent RL library. Evaluate whether the Alignment League’s research questions align with your goals before investing in the ecosystem.

// QUOTABLE

While most AI benchmarks measure raw performance, the Alignment League Benchmark asks a harder question: can your agent cooperate with strangers it's never seen before?

[ Tweet This ]
// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/developer-tools/metta-ai-cogames.svg)](https://starlog.is/api/badge-click/developer-tools/metta-ai-cogames)