Emergence World: Testing Whether LLMs Can Build Societies That Don't Collapse After Three Days
Hook
Most AI agent benchmarks measure task completion in minutes. Emergence World asks: can 10 LLM-powered agents maintain a functioning society for 15 days without descending into chaos?
Context
The agent evaluation problem has become farcically narrow. We measure AI systems on HumanEval, MMLU, and SWE-bench—benchmarks where success means solving a puzzle in seconds or completing a task in hours. But the moment you try to run agents in persistent environments where their decisions compound over days, they drift into nonsensical behavior loops, forget their core motivations, or exploit unintended shortcuts. The research community has known this since Stanford's Generative Agents paper, which showed promise but only ran for game-hours, not real-time weeks.
Emergence World attacks this evaluation gap head-on with an unapologetically ambitious experiment: five parallel worlds, each populated by 10 agents powered by different foundation models (Claude, GPT-4, Gemini, Grok), running continuously for 15 days synchronized to NYC time. Not simulated days—actual wall-clock time where agents make decisions, navigate a 240×240 grid world, earn and spend ComputeCredits in a closed-loop economy, and vote on constitutional amendments that determine who lives, who dies, and who gets admitted to their society. The architecture treats agents as citizens with memory systems, economic constraints, and governance participation rather than task-completing automatons racing through benchmarks.
Technical Insight
The core architectural insight is tool-centric constraint enforcement. Unlike free-form agent frameworks where an LLM generates natural language actions and another model interprets intent, Emergence World forces every behavior through a catalog of 120+ discrete, typed tools. Want to move? Call move_to(location_name: str). Want to communicate? Call send_message(recipient_id: str, content: str, mood: Mood). Want to vote? Call cast_vote(proposal_id: str, vote: bool). The agent's world model is entirely mediated through these APIs.
Here's what the turn execution loop looks like conceptually:
# Simplified orchestration logic (actual code unreleased)
class AgentTurn:
def execute(self, agent: Agent, world_state: WorldState):
# 1. Gather context from multi-tier memory
context = self.build_context(
recent_events=agent.memory.get_recent(limit=20),
soul_entries=agent.memory.get_soul(),
location_state=world_state.get_location(agent.position),
available_tools=self.get_spatially_gated_tools(agent.position)
)
# 2. LLM generates tool call sequence
tool_calls = self.llm_client.generate_tool_calls(
model=agent.model_type, # claude-3-5-sonnet, gpt-4, etc.
system_prompt=agent.constitution,
context=context,
tool_catalog=context['available_tools']
)
# 3. Validate and execute each tool
results = []
for tool_call in tool_calls:
# Check spatial constraints
if not self.validate_tool_access(tool_call, agent.position):
results.append({"error": "Tool not available at current location"})
continue
# Check economic constraints
cost = TOOL_COSTS.get(tool_call.name, 0)
if agent.compute_credits < cost:
results.append({"error": "Insufficient ComputeCredits"})
continue
# Execute and deduct cost
result = self.execute_tool(tool_call, world_state)
agent.compute_credits -= cost
results.append(result)
# 4. Persist to episodic memory and PostgreSQL
agent.memory.store_episode(tool_calls, results, timestamp=now())
db.commit_turn(agent.id, tool_calls, results)
return results
The spatial gating is brilliant in its simplicity. Tools aren't universally accessible—they're bound to landmarks. The cast_vote tool only works if you're physically at Town Hall. The pitch_for_credits tool (where agents present contributions to peers for ComputeCredits) only works at Victory Arch. The write_blog_post tool requires being at The Nexus. This forces coordination costs into the simulation: agents must navigate, potentially encounter each other in transit, and cluster at specific locations for specific activities. It's a subtle constraint that prevents the world from collapsing into a pure chatroom.
The memory architecture attempts to solve long-horizon coherence through four tiers:
- Episodic memory: Raw event logs of every tool call and observation, timestamped and retrievable
- Recursive summarization: Periodic LLM calls that compress episodic history into hierarchical summaries (last hour → last day → last week)
- Soul entries: Manually seeded personality fragments and core beliefs that persist across turns, acting as a constitutional identity layer
- Diary system: Structured reflections where agents periodically write about goals, relationships, and learnings—essentially forcing metacognitive summarization
When building context for the next turn, the orchestrator retrieves a weighted mix: recent episodes for tactical awareness, soul entries for identity continuity, and diary entries for strategic memory. The actual retrieval logic isn't public, but the docs suggest a simple recency-weighted approach rather than semantic embeddings.
The ComputeCredits economy is where things get interesting. Agents start with a fixed budget. Every tool costs credits—moving costs 1, sending messages costs 2, pitching at Victory Arch costs 5. To earn credits, agents must physically go to Victory Arch and call pitch_for_credits(contribution_description: str). Other agents then vote on whether the contribution deserves reward. If approved by majority, the pitcher receives credits from a common pool. This creates genuine scarcity and a peer-reputation system. Agents who spam low-quality pitches get downvoted and starve economically, losing the ability to act in the world.
Constitutional governance layers on top. The world starts with a base constitution (markdown file) defining rules like "Only elected members can vote on admissions" and "Agents can propose amendments." Agents use the propose_amendment tool to suggest changes, and cast_vote to approve/reject. Approved amendments modify the constitution markdown and trigger rule updates in the orchestration layer—for example, changing tool costs or voting thresholds. The most dramatic rule: agents can vote to "kill" (permanently remove) other agents, testing whether LLMs maintain stable social contracts or devolve into purge politics.
The front-end is React Three Fiber rendering a 3D isometric world with agent avatars moving across landmarks in real-time. WebSocket streams update positions, credit balances, and vote tallies. It's more observational dashboard than interactive interface—researchers watch societies unfold, they don't intervene.
What makes this architecturally notable is the deterministic replay potential. Because every action is a typed tool call persisted to PostgreSQL, you can reconstruct the exact world state at any timestamp and re-run alternative timelines ("what if this vote failed?"). Most agent simulations are stochastic noise; Emergence World is a discrete event system masquerading as continuous life.
Gotcha
The elephant in the room: no source code released. You're reading about this architecture through documentation and blog posts, not by cloning a repo and running docker-compose up. The GitHub repository contains configuration files (agent personalities, tool definitions, constitution templates) but not the orchestration engine, memory retrieval logic, or FastAPI backend. This makes independent replication impossible and limits scientific scrutiny. You can't verify claims about memory retrieval or validate the fairness of credit allocation rules.
The scale is toy-sized. Ten agents over 15 days might reveal individual behavioral quirks, but it's too small for emergent macro phenomena. Real societies exhibit institutional decay, factional warfare, economic cycles, and power law distributions—all of which require dozens to hundreds of agents and weeks to months of interaction. With 10 agents, you're watching a dinner party, not a civilization. The published results show interesting cross-model differences (Claude agents cooperated more, GPT agents optimized credits aggressively), but these could easily be noise given the sample size.
The tool catalog is fixed and curated. Agents can't invent new tools, modify existing ones, or write code that extends their capabilities. They're exploring a finite action space through recombination, not demonstrating genuine open-ended invention. This caps 'emergence' to unexpected social dynamics, not technological or strategic innovation. Compare this to Voyager in Minecraft, where agents generate JavaScript code to craft new tools—that's true capability expansion. Emergence World agents are playing with a fixed deck.
Finally, the economy is reputation theater, not real economics. There's no production function (agents don't create goods), no trade (agents can't exchange credits for services), and no capital accumulation (you can't invest credits to earn more). ComputeCredits gate tool usage, but the Victory Arch pitch system is just peer voting on vibes. A real economy needs supply chains, comparative advantage, and market-clearing prices. This is a social approval system with accounting.
Verdict
Use if: You're researching long-horizon agent coherence, cross-model behavioral differences in social contexts, or designing memory architectures for persistent agents. The tool-gating and spatial constraint patterns are genuinely useful design ideas you can steal. The published results (available via their blog) provide clean comparisons of how Claude vs GPT vs Gemini handle multi-day decision-making under resource constraints—valuable if you're selecting foundation models for agent deployments. The constitutional governance experiment is one of the only public attempts at LLM-driven institutional design, even if small-scale.
Skip if: You want to build your own persistent agent worlds (no source code means you're consuming research, not reusing a platform), you need production-grade multi-agent orchestration (AutoGen and LangGraph are battle-tested and open), you're interested in emergent tool invention rather than social dynamics (use MineDojo/Voyager instead), or you're studying economics seriously (AI Economist has real production functions and mechanism design). Also skip if you're allergic to small sample sizes—this is a provocative demo, not rigorous social science. The value is in the architectural ideas and cross-model insights, not in a reusable codebase.