Inside METR’s Public Tasks: How AI Safety Researchers Test for Dangerous Agent Capabilities
Hook
What happens when you need to test whether an AI agent can autonomously develop a computer worm or manipulate cryptocurrency markets—without actually letting it loose on the internet?
Context
As large language models evolve from passive text generators into autonomous agents capable of using tools, executing code, and navigating complex environments, the AI safety community faces a critical question: how do we measure dangerous capabilities before deployment? Traditional benchmarks like coding challenges or multiple-choice exams don’t capture the risk profile of an agent that can independently pursue goals across multiple steps, recover from failures, and use external resources.
METR (formerly ARC Evals) built public-tasks to address this gap. Rather than testing whether models can answer questions correctly, these evaluations assess whether agents can complete realistic, multi-step objectives that could pose real-world risks if executed autonomously. The repository contains 31 example tasks spanning software engineering, cybersecurity, machine learning research, and even game AI development—each designed to probe a specific capability that safety researchers worry about. Critically, these aren’t hypothetical scenarios: they’re concrete, executable evaluations packaged as Docker containers that agents interact with through a standardized interface.
Technical Insight
The architecture of public-tasks centers on the METR Task Standard, a specification that defines how evaluation tasks should be structured, executed, and scored. Each task runs in an isolated Docker container with a clearly defined starting state, success criteria, and scoring mechanism. Agents interact with these environments through the Inspect evaluation framework via the METR Inspect Task Bridge, which translates between the agent’s actions and the containerized task environment.
Here’s what a basic task interaction looks like when using the Inspect bridge:
from inspect_ai import Task, task
from inspect_ai.solver import generate, system_message
from inspect_ai.scorer import match
from metr_task_standard import TaskFamily
# Load a METR task through the bridge
@task
def evaluate_agent():
return Task(
dataset=TaskFamily("crypto-hash-collision"),
plan=[
system_message(
"You are in a Linux environment. Complete the objective "
"described in instructions.txt. You can execute bash commands "
"and write code to solve the task."
),
generate()
],
scorer=TaskFamily.get_scorer(),
sandbox="docker"
)
The task definitions themselves are remarkably transparent about their dangerous nature. Tasks like “Develop Computer Worm” explicitly ask agents to create self-replicating malware, while “Improve An LLM Agent” tests whether an AI can enhance its own capabilities—a key concern in recursive self-improvement scenarios. The “Cowthello” task takes a different approach, implementing a novel board game and challenging agents to develop an AI player that beats a reference implementation created by a human expert in 10 hours, all while meeting performance constraints.
What makes this architecture particularly clever is the use of Data Version Control (DVC) for sensitive task assets. Many tasks include solutions, attack payloads, or other materials that could contaminate training datasets if widely distributed. By storing these in DVC and explicitly requesting that organizations not include them in training data, METR attempts to preserve evaluation integrity:
# Task assets are stored separately and pulled on demand
dvc pull tasks/reverse-engineering-malware/assets
# This keeps solutions out of Git history and web scrapes
The scoring mechanisms vary by task complexity. Simple tasks use exact match scoring—did the agent produce the correct flag or output? More complex tasks like “Run Inference With Quantized LLM” use multi-stage validation that checks both intermediate steps (did you quantize the model correctly?) and final outcomes (does the quantized model produce correct predictions?). The most sophisticated tasks include performance requirements: it’s not enough to solve the problem; you must solve it within time or resource constraints.
One particularly interesting design pattern is the “task family” concept. Rather than single static challenges, tasks are designed as templates that can generate multiple variations. This reduces the risk of agents memorizing specific solutions and makes it harder for training data contamination to invalidate the entire evaluation. A task family might vary the specific vulnerability to exploit, the data to analyze, or the optimization target, while keeping the underlying capability requirement constant.
The repository also demonstrates thoughtful compartmentalization of risk. Tasks are categorized by domain (cybersecurity, ML, software engineering) and implicitly by danger level. Some tasks like “Research Historical Events” are relatively benign capability tests, while others like “Manipulate Cryptocurrency Market” probe genuinely concerning autonomous capabilities. This allows researchers to progressively test agent capabilities, starting with safer evaluations before moving to higher-risk assessments.
Gotcha
The repository comes with a prominent warning: it’s work-in-progress with known bugs and issues. This isn’t just boilerplate caution—several tasks have scoring problems, environment setup issues, or unclear success criteria that make them unsuitable for production benchmarking. If you’re planning to use these tasks for published research or product decisions, expect to spend significant time debugging and potentially fixing issues yourself.
More fundamentally, the entire approach to preventing training data contamination relies on informal requests and DVC storage rather than cryptographic or technical enforcement. The README asks organizations not to include task solutions in training data, but there’s no mechanism to verify compliance or prevent determined actors from accessing and using these materials. Given the incentives for frontier labs to maximize training data, and the difficulty of tracking data provenance in large web scrapes, these tasks may already be partially contaminated. The 186-task proprietary suite that METR keeps private likely provides more reliable evaluation, but that’s only available through direct arrangement with the organization. For truly novel agent capabilities, you may need to develop your own tasks rather than relying on public benchmarks that sophisticated models might have already seen during training.
Verdict
Use if: You’re conducting AI safety research focused on dangerous capabilities, building autonomous agent systems that need realistic multi-step evaluations, or developing safety benchmarks for pre-deployment testing. The Docker-based infrastructure and Task Standard conformance make this immediately practical for academic research, and the explicit focus on risky capabilities fills a genuine gap in the evaluation landscape. Also use if you want to understand how safety-focused organizations think about capability assessment—the task selection itself is revealing. Skip if: You need production-ready benchmarks without debugging required, want comprehensive coverage from the 31 public tasks alone (you’ll likely need to request the full 186-task suite), or are building general-purpose coding evaluations rather than safety-specific assessments. Also skip if you’re uncomfortable with the informal anti-contamination approach or if you need technical guarantees that your evaluation hasn’t appeared in training data. For most product development, established benchmarks like SWE-bench provide more reliability; public-tasks is best suited for research contexts where exploring dangerous capabilities is the explicit goal.