Back to Articles

Harbor: The Missing Standard for Evaluating AI Coding Agents at Scale

[ View on GitHub ]

Harbor: The Missing Standard for Evaluating AI Coding Agents at Scale

Hook

Evaluating a single AI coding agent on SWE-Bench can take 48 hours on a laptop. Harbor researchers routinely run 500 parallel evaluations in the same time—and generate training data for reinforcement learning while they're at it.

Context

The explosion of AI coding agents—from Anthropic's Claude Code to OpenHands to Devin—created a new problem: no standard way to compare them. Each agent ships with its own evaluation scripts, environment setup, and benchmark implementations. Want to see if OpenHands beats Claude Code on SWE-Bench? You'll need to wrangle two different Docker configurations, manually normalize outputs, and probably write glue code. Want to add a new benchmark? Copy-paste evaluation logic across every agent implementation.

This fragmentation isn't just annoying—it's scientifically problematic. Different evaluation harnesses introduce confounding variables. One agent might perform better simply because its evaluation environment has more generous timeouts or different dependencies installed. Harbor emerged from the Terminal-Bench project as a solution: a framework that standardizes the entire evaluation pipeline from dataset loading through environment provisioning to result collection. It treats agents as black boxes implementing a common interface, and benchmarks as declarative configurations that any agent can run against.

Technical Insight

Harbor's architecture revolves around three core abstractions: Datasets, Agents, and Environment Providers. Datasets define what to evaluate (problem instances from benchmarks), Agents define how to solve problems (the LLM-powered systems under test), and Environment Providers define where evaluation happens (local Docker, cloud VMs, etc.).

Here's what a minimal evaluation looks like:

from harbor import Dataset, Agent
from harbor.providers import DockerProvider

# Load a benchmark dataset
dataset = Dataset.load("terminal-bench/basic")

# Configure the agent to test
agent = Agent.from_config(
    name="claude-code",
    model="claude-3-5-sonnet-20241022",
    max_tokens=100000
)

# Run evaluation locally with Docker
results = dataset.evaluate(
    agent=agent,
    provider=DockerProvider(),
    num_workers=4
)

print(f"Success rate: {results.success_rate}")
print(f"Average cost: ${results.avg_cost}")

Switch to cloud execution by changing one parameter: provider=ModalProvider(n_containers=200). Harbor handles environment provisioning, task distribution, and result aggregation identically across providers.

The Agent abstraction is particularly clever. Instead of requiring agents to implement a Harbor-specific interface, it wraps existing agent implementations through adapters. For example, the OpenHands adapter spawns the OpenHands runtime and translates between Harbor's task format and OpenHands' expected inputs:

class OpenHandsAdapter(Agent):
    def solve(self, task: Task, environment: Environment) -> AgentRun:
        # Start OpenHands runtime in the environment
        runtime = environment.spawn_process(
            "python -m openhands.runtime.server"
        )
        
        # Send task in OpenHands format
        response = runtime.send({
            "instruction": task.instruction,
            "workspace": environment.workspace_path
        })
        
        # Translate back to Harbor format
        return AgentRun(
            actions=response["actions"],
            final_state=environment.capture_state(),
            cost=response["tokens"] * self.cost_per_token
        )

This adapter pattern means you can evaluate proprietary agents like Claude Code alongside open-source ones without either side knowing about Harbor's internals.

The Dataset abstraction handles both static benchmarks and dynamic problem generation. Terminal-Bench tasks are defined as YAML files specifying initial state, success criteria, and metadata:

task_id: "file-manipulation-042"
instruction: "Find all Python files containing 'TODO' and create a summary"
initial_state:
  files:
    - path: "src/app.py"
      content: "# TODO: Add logging"
    - path: "tests/test.py"
      content: "# TODO: More assertions"
success_criteria:
  - type: "file_exists"
    path: "summary.txt"
  - type: "file_contains"
    path: "summary.txt"
    pattern: "src/app.py.*TODO: Add logging"
timeout: 300

Harbor validates success criteria inside the containerized environment, ensuring agents can't accidentally cheat by modifying the host system. Each task runs in a fresh container with exactly the specified initial state.

The RL integration is where Harbor gets interesting for researchers. Setting collect_rollouts=True captures the full trajectory—every command executed, every file modified, every LLM call—in a format compatible with RL training:

results = dataset.evaluate(
    agent=agent,
    provider=DockerProvider(),
    collect_rollouts=True
)

# Rollouts include full state transitions
for rollout in results.rollouts:
    for step in rollout.steps:
        print(f"State: {step.state_hash}")
        print(f"Action: {step.action}")
        print(f"Reward: {step.reward}")  # Based on success criteria
        print(f"Next state: {step.next_state_hash}")

These rollouts can feed directly into RL algorithms like PPO or DPO, turning Harbor from an evaluation framework into a training data generator. The Terminal-Bench team uses this to iteratively improve agents: evaluate on the benchmark, collect failed rollouts, train on corrections, repeat.

Gotcha

Harbor's containerization requirement is both its strength and weakness. Every evaluation runs in a fresh Docker container, which guarantees isolation and reproducibility but introduces 2-3 seconds of startup overhead per task. For benchmarks with thousands of quick tasks, this overhead dominates runtime. The framework currently has no "batch mode" to reuse containers across tasks, though this is planned.

Cloud provider support is powerful but immature. Daytona and Modal integration works well for Terminal-Bench's Linux-based tasks, but both providers have quirks. Daytona's VM provisioning can take 30-60 seconds per instance, making small evaluations slower than local execution. Modal requires restructuring your code to work with their serverless model—you can't just drop in Harbor and expect it to work without refactoring. Neither provider supports GPU-accelerated agents yet, limiting use cases involving local model inference.

The framework strongly assumes command-line/coding tasks. While technically you could adapt Harbor for other domains, the entire architecture—containerized filesystems, bash command validation, file-based success criteria—is optimized for terminal-based agents. If you're evaluating agents that interact with GUIs, databases, or APIs directly, you'll fight Harbor's abstractions rather than benefit from them.

Verdict

Use if: You're evaluating multiple coding agents on the same benchmark and need reproducible comparisons; you're working with Terminal-Bench or want to contribute new tasks to it; you need to scale evaluations to hundreds of parallel runs; or you're doing RL research and need high-quality trajectory data from agent interactions. Harbor eliminates the grunt work of evaluation infrastructure so you can focus on agent improvements.

Skip if: You're evaluating a single agent once or twice (the setup overhead isn't worth it); your tasks complete in under 5 seconds each (container startup dominates); you need GUI or API-based evaluation (Harbor is terminal-focused); or you're working in a domain where standardized benchmarks don't exist yet (you'll spend more time adapting Harbor than building custom evaluation). For simple cases, agent-specific evaluation scripts or general LLM frameworks like OpenAI Evals will be faster to results.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/ai-agents/harbor-framework-harbor.svg)](https://starlog.is/api/badge-click/ai-agents/harbor-framework-harbor)