Back to Articles

Vivaria: Inside METR's Agent Evaluation Platform Before the Migration to Inspect AI

[ View on GitHub ]

Vivaria: Inside METR’s Agent Evaluation Platform Before the Migration to Inspect AI

Hook

METR built one of the most sophisticated agent evaluation platforms in the AI safety space, then publicly announced they’re abandoning it. Here’s what made Vivaria special—and why understanding it still matters.

Context

As AI agents grow more capable, evaluating their behavior becomes exponentially harder. Unlike traditional software testing where you verify deterministic outputs, agent evaluation requires understanding decision-making processes, tracking multi-step reasoning chains, and identifying failure modes that might only appear in complex scenarios. You need to capture every LLM call, every action taken, every observation received—then make that haystack of data actually analyzable.

Vivaria emerged from METR’s (formerly ARC Evals) work on AI safety evaluations, where researchers needed to study not just whether agents could complete tasks, but how they approached them. Could an agent autonomously replicate itself? Could it deceive researchers? These questions require more than pass/fail metrics—they demand deep visibility into agent cognition. Vivaria was built to provide that visibility through containerized task environments, comprehensive trace capture, and collaborative annotation workflows. The platform represents a complete vertical solution: from defining tasks to running agents to analyzing their behavior with a research team.

Technical Insight

Task Environment

View traces & annotate

Spawn tasks & runs

bash/read_file/generate

HTTP requests

Log all actions

Manage lifecycle

Route LLM calls

Responses

Return results

Task Standard interface

React Web UI

Node.js Server

PostgreSQL Database

Python CLI

AI Agent Code

Pyhooks Library

Docker Task Container

LLM APIs

System architecture — auto-generated

Vivaria’s architecture centers on the METR Task Standard, a Docker-based abstraction that treats every evaluation as an isolated, reproducible environment. Each task is a containerized workspace with a standardized interface—agents interact through a Python hooks library (pyhooks) that communicates with the central server over HTTP.

Here’s what a basic task definition looks like:

from pyhooks import Agent, score_task

class MyTask:
    def start(self):
        return "You are a research assistant. Complete the following task: ..."
    
    def score(self, submission: str) -> float:
        # Custom scoring logic
        if self.verify_solution(submission):
            return 1.0
        return 0.0

# Agent code using pyhooks
agent = Agent()
initial_prompt = agent.get_task_instructions()

# Agent makes LLM calls - all intercepted and logged
response = agent.generate(prompt=initial_prompt)

# Agent can take actions in the environment
agent.bash("ls -la")
file_contents = agent.read_file("data.txt")

# Submit final answer
agent.submit(response)

The pyhooks library is deceptively simple but architecturally significant. Every method call—bash(), read_file(), generate()—hits the Vivaria server, which logs it to PostgreSQL before executing. This creates a complete audit trail without agents needing evaluation-specific instrumentation. The server manages Docker containers for isolation, routes LLM requests through configured providers (OpenAI, Anthropic, etc.), and captures token-level details of every API call.

On the backend, Vivaria uses a TypeScript/Node.js server with a PostgreSQL schema designed specifically for trace storage. The database models runs as first-class entities with relationships to task environments, agent containers, and execution traces. Each trace entry captures timestamps, token counts, model parameters, prompts, and completions. This structured approach enables queries like “show me all runs where the agent used more than 10,000 tokens” or “find instances where GPT-4 refused a request but Claude continued.”

The React-based UI provides the collaboration layer that distinguishes Vivaria from simpler eval frameworks. Researchers can tag specific trace entries, add comments to agent actions, and create annotations that link to sections of execution logs. This turns raw trace data into collaborative artifacts—teams can discuss whether a particular agent behavior represents deception, goal drift, or simple failure.

The built-in playground demonstrates Vivaria’s focus on iteration speed. Rather than running full evaluations to test prompt variations, researchers can experiment in an interactive REPL-like environment that still captures all the trace data. You modify a prompt, see immediate results, and those experiments become part of your research record automatically.

// Simplified server-side route handling agent actions
router.post('/agent/:runId/bash', async (req, res) => {
  const { runId } = req.params;
  const { command } = req.body;
  
  // Log to trace table
  await db.traces.insert({
    runId,
    type: 'bash',
    input: command,
    timestamp: Date.now()
  });
  
  // Execute in container
  const container = await getAgentContainer(runId);
  const output = await container.exec(command);
  
  // Log output
  await db.traces.update({
    runId,
    output,
    completed: Date.now()
  });
  
  return res.json({ output });
});

This trace-everything architecture has implications beyond logging. Because every agent interaction passes through the server, Vivaria can enforce safety boundaries, rate limit API calls, or inject monitoring without touching agent code. It’s instrumentation by infrastructure rather than by convention.

Gotcha

The elephant in the room: METR explicitly states they’re transitioning away from Vivaria to Inspect AI and ramping down feature development. This isn’t a subtle deprecation buried in release notes—it’s front-and-center in the README. For production evaluation pipelines or long-term research infrastructure, this makes Vivaria a non-starter. You’d be building on a foundation its creators have abandoned.

Beyond the deprecation, Vivaria’s infrastructure requirements are substantial. You need Docker (with permissions to create containers), PostgreSQL with specific schema setup, Auth0 configuration for multi-user access, and significant operational knowledge to debug when containers misbehave or traces don’t appear as expected. The documentation acknowledges APIs are unstable with no versioning guarantees—breaking changes can land in any commit. For quick experiments or solo research, this operational overhead outweighs the benefits. The platform was clearly designed for sustained team-based research programs, not lightweight evaluation needs. If you’re comparing approaches for a single paper or a small study, simpler frameworks will get you to results faster.

Verdict

Use if: You’re already running Vivaria in production and the migration cost to Inspect AI exceeds your remaining research timeline, or you’re studying Vivaria specifically as a reference implementation for building agent evaluation infrastructure. The codebase demonstrates sophisticated approaches to trace capture and collaborative analysis that remain instructive even as the platform sunsets. Skip if: You’re starting a new project—METR’s own recommendation to use Inspect AI should be definitive. Also skip if you need stable APIs, minimal operational complexity, or long-term support. The combination of deprecated status, unstable interfaces, and heavy infrastructure requirements makes Vivaria unsuitable for most modern evaluation needs. Learn from its architecture, but build on actively maintained alternatives.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/ai-agents/metr-vivaria.svg)](https://starlog.is/api/badge-click/ai-agents/metr-vivaria)