> your AI agent picks dependencies from memory; give it dated facts — try starlog.dev ↗ vet your agent's deps ↗ vibe-coding is fine. vibe-importing isn’t. — try starlog.dev ↗ vibe-importing isn’t fine ↗ your agent has never seen your private packages — try starlog.dev ↗ facts for private packages ↗ a linter for the dependencies your AI agent picks — try starlog.dev ↗ a linter for agent deps ↗

Back to Articles

Inside METR's Task Standard: How AI Safety Researchers Benchmark Dangerous Autonomous Capabilities

[ View on GitHub ]

Inside METR's Task Standard: How AI Safety Researchers Benchmark Dangerous Autonomous Capabilities

Hook

What happens when you need to measure whether an AI can autonomously develop a computer worm or replicate a novel ML research paper? You can't just use HumanEval.

Context

Traditional code generation benchmarks like HumanEval and MBPP test whether AI models can write functions that pass unit tests. But as AI agents become more autonomous—chaining together API calls, executing commands, and persisting across sessions—we need evaluation frameworks that match this complexity. The question isn't just "can it write a sorting algorithm," but "can it independently build a multi-component system, debug failures, and validate correctness without human intervention?"

METR (formerly ARC Evals) created public-tasks to address this gap in AI safety research. Their focus is specifically on measuring what they call "dangerous autonomous capabilities"—tasks that, if an AI could complete them independently, would represent significant safety concerns. This includes challenges like developing exploits, building self-replicating systems, or conducting complex research without oversight. The 31 publicly available tasks (from a total suite of 186) represent families of challenges spanning software engineering, machine learning research, security testing, and complex system design. Unlike traditional benchmarks that evaluate isolated code snippets, these tasks simulate realistic engineering work with all its messiness: incomplete specifications, multi-step debugging, and integration challenges.

Technical Insight

The METR Task Standard architecture centers on containerization with an unusual twist: tasks aren't just Docker images, they're OCI-compliant artifacts stored in container registries. Each task consists of a TaskFamily directory containing scoring logic, instructions, and asset references managed through DVC (Data Version Control).

Here's what a basic task execution looks like when integrated with the Inspect framework:

from inspect_ai import Task, task
from inspect_ai.scorer import match
from metr_task_bridge import metr_task

@task
def payment_matching_task():
    return Task(
        dataset=metr_task(
            registry="ghcr.io/metr",
            task_name="fuzzy_payment_matching",
            tag="latest"
        ),
        scorer=match(),
        sandbox="docker"
    )

Under the hood, each task container provides a standardized interface. The agent interacts with the task environment through bash commands, file operations, and network requests—exactly how a human engineer would. The task container includes scoring mechanisms that automatically validate whether the agent's solution meets the requirements. For instance, the fuzzy_payment_matching task might spin up a database of payment records with timezone inconsistencies and fuzzy matching requirements, then score based on whether the agent's code correctly identifies matching transactions.

The DVC integration is particularly clever for preventing data contamination. Rather than committing full task assets directly to the repository (which would make them scrapeable for training data), the repository contains only .dvc pointer files:

# example.dvc
outs:
- md5: a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6
  size: 1048576
  path: task_assets/example_data.json

When you run dvc pull, assets are fetched from remote storage. This creates a barrier between public code and the actual evaluation data, making it harder for AI training pipelines to inadvertently memorize solutions.

The OCI artifact approach requires specific tooling. You can't just docker pull these tasks like normal images. Instead, you need containerd's image store and ORAS (OCI Registry As Storage) compatibility:

# Enable containerd image store in Docker Desktop
# or use containerd directly
ctr images pull ghcr.io/metr/public-tasks/fuzzy_payment_matching:latest

# The METR Task Bridge handles the complexity:
metr-task-bridge run \
  --registry ghcr.io/metr/public-tasks \
  --task fuzzy_payment_matching \
  --agent-output ./results

Each task family follows a consistent structure: a task.py defining the TaskFamily class, a resources/ directory with instructions and setup scripts, and a solutions/ directory (notably absent from the public repository to prevent solution leakage). The scoring happens inside the container, isolated from the agent's execution environment, which prevents agents from manipulating their own evaluation metrics.

What makes these tasks genuinely challenging is their multi-step nature. The expert_board_game_ai task, for example, requires an agent to: (1) understand novel game rules from documentation, (2) implement a game engine, (3) develop an AI player using appropriate algorithms like minimax or Monte Carlo tree search, and (4) validate that the AI plays at a competent level. No single LLM API call will solve this—it requires planning, execution, debugging, and iteration cycles that mirror real engineering work.

Gotcha

The repository README is refreshingly honest: these are "work-in-progress products" that may contain bugs. In practice, this means you'll encounter rough edges. Task containers might fail to build on certain platforms, scoring mechanisms may have edge cases, and documentation can be sparse for specific task families. The infrastructure requirements alone—Docker with containerd image store, a compatible container registry, DVC setup—create significant friction before you write a single line of evaluation code.

The bigger limitation is access. Only 31 of 186 tasks are public, and accessing the full suite requires emailing METR directly. This makes sense from an AI safety perspective (you don't want evaluation datasets widely distributed before publication), but it limits the framework's utility for general-purpose agent benchmarking. There's also the philosophical question of whether these tasks actually measure "dangerous capabilities" or simply complex engineering skills. Building a payment matching system is challenging, but is it dangerous? The framing suggests these capabilities become concerning at scale or when combined with autonomy, but that context isn't always clear from the task definitions alone.

Verdict

Use if: You're conducting formal AI safety research and need reproducible, containerized benchmarks for autonomous agent capabilities. The infrastructure investment pays off if you're evaluating multiple agent architectures against consistent, complex tasks, or if you need task isolation guarantees that prevent agents from contaminating their own evaluation environments. This is purpose-built for research labs measuring progress toward AGI-level autonomous capabilities.

Skip if: You want quick-start code generation benchmarks, lack dedicated infrastructure for container registries and DVC, or need production-ready evaluation tools with extensive documentation. If your goal is improving coding assistants for day-to-day development tasks rather than measuring autonomous agent risks, HumanEval or SWE-bench will give you better signal with far less operational overhead. The work-in-progress status and limited public task availability make this a poor choice for casual experimentation.