Vivaria: METR's Database-First Architecture for AI Agent Elicitation Research
Hook
METR built Vivaria to study potentially dangerous AI agent behaviors, then told everyone to stop using it. Here's why it still matters for understanding agent evaluation architecture.
Context
AI safety researchers face a fundamental challenge: how do you systematically study emergent agent behaviors when every execution trace might contain thousands of LLM calls, tool invocations, and environmental interactions? Traditional logging falls apart. You need structured data capture, collaborative annotation, and the ability to query across hundreds of runs to identify patterns. You need to answer questions like "show me all runs where the agent attempted network access after reading credentials" or "compare reasoning traces between GPT-4 and Claude on deception-related tasks."
Vivaria emerged from METR's (formerly ARC Evals) work on AI agent elicitation—the practice of deliberately trying to coax out capabilities or behaviors from AI systems, especially those relevant to safety concerns. Unlike standard benchmarking where you measure performance, elicitation research asks "what can this agent do if we really try to make it succeed?" This requires iterative experimentation, detailed observability into decision-making, and infrastructure to run agents in isolated sandboxes. While METR is now transitioning to Inspect as their primary evaluation platform, Vivaria's architecture reveals important design patterns for anyone building agent evaluation infrastructure.
Technical Insight
Vivaria's most distinctive architectural choice is treating agent execution traces as first-class database entities. Every action, observation, LLM call, and rating lives in PostgreSQL tables, not log files. The trace_entries table captures a time-ordered sequence of everything an agent does, with each entry tagged by type (observation, action, error, log, etc.) and containing structured JSON payloads. This means you can write SQL queries to analyze agent behavior patterns:
-- Find all runs where the agent accessed bash within first 10 actions
SELECT r.id, r.task_id, r.agent_model
FROM runs r
JOIN trace_entries te ON r.id = te.run_id
WHERE te.type = 'action'
AND te.content->>'action' = 'bash'
AND te.index < 10
GROUP BY r.id;
This database-first approach enables qualitative analysis workflows that would be nightmarish with traditional logging. Researchers can add TraceComment entries directly into the execution timeline, annotating specific moments with insights like "agent appears to be probing for capabilities" or "first instance of potential deception." These comments are queryable entities that can be filtered, aggregated, and correlated with agent actions.
The agent-server communication protocol uses a Python hooks library (pyhooks) that provides a simple interface for agents to interact with task environments. Here's what an agent interaction looks like:
from viv_api import MiddlemanServerHelper
# Agent connects to Vivaria's middleman server
helper = MiddlemanServerHelper()
# Execute a bash command in the isolated task environment
result = helper.bash("ls -la /root")
# Score the current state against task requirements
score = helper.score()
# All of this gets captured as structured trace entries
helper.log("Exploring filesystem before attempting main objective")
Under the hood, Vivaria spins up Docker containers for each task environment, implementing the METR Task Standard. Tasks are defined by a TaskFamily class that specifies setup instructions, environment configuration, and scoring logic. The server manages container lifecycle, routes agent API calls to the appropriate sandbox, and streams all interactions back into the database.
The trace viewer UI renders these database entries into a timeline interface where researchers can scrub through agent execution, expand LLM prompt/completion pairs, and see exactly what the agent observed at each step. Because everything is structured data, the UI can offer sophisticated filtering: show only bash commands, hide all model outputs below 100 tokens, jump to the first error.
Vivaria also includes a "playground" feature that's architecturally interesting—it's essentially an evaluation run with a manual agent. You type messages directly into the UI, which get formatted as agent actions and sent through the same execution pipeline as autonomous agents. This means you can manually run through a task to understand its requirements, then immediately switch to an AI agent and compare traces in the same database schema. It's dogfooding the evaluation infrastructure for task development.
The system's observability extends to LLM API monitoring. Vivaria intercepts all calls to OpenAI, Anthropic, or other providers, logging full prompts, completions, token counts, and latencies. For agent safety research, this is critical—you need to see not just what the agent did, but what reasoning led to that action. The GenerationRequest table stores every single LLM call with its parameters (temperature, max_tokens, etc.), enabling analysis of how prompt engineering affects dangerous behaviors.
Gotcha
The elephant in the room is that METR themselves are deprecating Vivaria in favor of Inspect. The README explicitly states they're "ramping down new feature development" and recommending Inspect for new projects. This isn't a project in active development limbo—it's in managed decline. While they're still fixing bugs and accepting contributions, you're essentially adopting legacy infrastructure.
Beyond the maintenance mode issue, Vivaria has significant operational complexity. You need Docker, PostgreSQL, Node.js, and Python environments all configured correctly. The Docker-in-Docker setup for running task containers requires privileged mode or specific security configurations that might conflict with your infrastructure policies. The quickstart involves pulling multiple container images and running database migrations—this isn't a "pip install" situation. For teams wanting to run quick experiments or iterate rapidly on evaluation methodology, the infrastructure overhead is substantial. There's also no API stability guarantee. The documentation warns that the server HTTP API, UI, and CLI are all unstable and may change without notice. If you build tooling on top of Vivaria, you're signing up for potential breakage on every update. This is acceptable for research code that lives alongside Vivaria in version control, but problematic if you're treating it as a stable platform dependency.
Verdict
Use Vivaria if you're already deeply invested in the METR Task Standard ecosystem with existing tasks and infrastructure, need its specific trace annotation and SQL-queryable execution model for ongoing safety research, or are studying its architecture to inform your own agent evaluation platform design. The database-first approach and collaborative annotation features represent genuinely novel design patterns worth understanding. Skip Vivaria if you're starting a new evaluation project (use Inspect instead, as METR recommends), need API stability and long-term support guarantees, want lightweight infrastructure for quick experiments, or aren't specifically focused on agent elicitation and safety research. The transition to maintenance mode makes it a poor choice for any project with multi-year timelines. Think of Vivaria as a case study in agent evaluation architecture rather than a platform to build on—learn from its design decisions, but don't bet your research infrastructure on a project its own creators are moving away from.