Harbor: The Container-Native Framework for Agent Evaluation at Scale
Hook
Most agent evaluation frameworks test your AI in sanitized playgrounds. Harbor throws it into real Docker containers with actual filesystems, terminals, and the full chaos of production environments—then scales that to 100+ concurrent evaluations.
Context
The explosion of AI coding agents—Claude Code, OpenHands, Codex CLI, and others—created an evaluation crisis. Traditional LLM benchmarks test isolated capabilities: reasoning, code generation, instruction following. But agents operate in environments. They navigate filesystems, execute commands, handle errors, and maintain state across multi-step interactions. Testing an agent by running it once on your laptop tells you almost nothing about its reliability across diverse scenarios.
Existing evaluation frameworks fell into two camps: lightweight harnesses that couldn’t provide realistic environments, or heavyweight platforms that bundled evaluation with specific agent architectures. Terminal-Bench, a rigorous benchmark for CLI-based tasks, needed something different: true environment isolation (so tests don’t interfere with each other), reproducibility across machines, and the ability to run hundreds of evaluations in parallel without melting your infrastructure. Harbor emerged from the Terminal-Bench creators as the solution—a framework that treats containerized environments as first-class citizens and makes scaling from local Docker to cloud providers feel trivial.
Technical Insight
Harbor’s architecture revolves around a clean separation of concerns: datasets define what to test, agents define how to interact, environments define where to run, and providers define the infrastructure layer. This abstraction lets you write an evaluation once and run it anywhere.
The basic workflow is surprisingly straightforward. Here’s running Terminal-Bench 2.0 with Claude against local Docker:
export ANTHROPIC_API_KEY=<YOUR-KEY>
harbor run --dataset terminal-bench@2.0 \
--agent claude-code \
--model anthropic/claude-opus-4-1 \
--n-concurrent 4
That --n-concurrent 4 flag spins up four Docker containers in parallel, each running an isolated instance of the benchmark. Harbor manages the orchestration: loading tasks from the dataset, instantiating the agent with the specified model backend, executing test cases in isolated containers, and collecting results. Switch to Daytona for cloud execution by adding two flags:
export DAYTONA_API_KEY=<YOUR-KEY>
harbor run --dataset terminal-bench@2.0 \
--agent claude-code \
--model anthropic/claude-opus-4-1 \
--n-concurrent 100 \
--env daytona
The provider abstraction is what makes this work. Locally, Harbor talks to Docker’s API to create containers, mount volumes, and stream logs. With --env daytona, the exact same evaluation logic runs on Daytona’s infrastructure, but now you can scale to 100+ concurrent environments without burning out your laptop’s CPU. The README mentions Modal as another provider option. The agent code doesn’t change—Harbor handles the plumbing.
The framework’s dual-purpose design is equally clever. While most developers will use Harbor for traditional evaluation (run benchmark, collect metrics, compare models), it also generates rollouts suitable for reinforcement learning training. Each evaluation produces structured data about the agent’s trajectory: actions taken, environment states, rewards received. This positions Harbor at the intersection of evaluation and optimization—you can identify where your agent fails, then generate training data to fix those failures.
Harbor ships with integrations for Terminal-Bench, SWE-Bench, and Aider Polyglot, but the framework allows you to build and share your own benchmarks and environments. The harbor datasets list command shows available third-party datasets, and running custom evaluations follows the same pattern:
harbor run -d "<dataset@version>" -m "<model>" -a "<agent>"
That version pinning (dataset@2.0) is critical for reproducibility. Benchmarks evolve, and Harbor’s versioning ensures your results remain comparable across time even if the dataset maintainer adds new test cases.
The containerization layer provides isolation that’s hard to achieve otherwise. When an agent executes rm -rf /tmp/* (intentionally or not), it only affects that container. When a test case requires Python 3.8 but another needs Python 3.11, they run in separate containers with different base images. This isolation is why Harbor can safely parallelize evaluations—there’s no shared state to corrupt, no filesystem conflicts, no port collisions.
Gotcha
Harbor’s reliance on containerization is both its strength and its Achilles’ heel. Spinning up Docker containers introduces overhead—seconds of startup time per evaluation, plus the resource footprint of container runtimes. If you’re testing simple prompt-response pairs or running benchmarks that don’t need environment isolation, Harbor’s machinery is overkill. A lightweight framework that runs evaluations in-process would be significantly faster for those use cases.
The framework also appears to be relatively new. The README provides clear CLI examples and links to a Cookbook repository for “end-to-end examples and guides,” but evaluating the depth of documentation beyond the basics requires consulting those external resources. The Discord badge suggests an active community for support. Early adopters should be prepared to explore the codebase and potentially contribute back as they encounter edge cases or advanced use cases not yet covered in documentation.
Verdict
Use Harbor if you’re evaluating coding agents in realistic environments, running benchmarks like Terminal-Bench or SWE-Bench, or need to scale evaluations across cloud infrastructure. It’s particularly valuable when you need true isolation between test cases or when you’re generating training data for RL optimization. The provider abstraction makes it the rare framework where local development and cloud production genuinely feel identical. Skip it if you’re testing simple language model capabilities without environment interaction, need minimal evaluation overhead, or require extensively documented frameworks for every edge case. For non-coding agent tasks or quick prompt experiments, lighter evaluation approaches will get you results faster without the container orchestration complexity.