Back to Articles

TheAgentCompany: Benchmarking AI Agents in a Fully-Simulated Software Company

[ View on GitHub ]

TheAgentCompany: Benchmarking AI Agents in a Fully-Simulated Software Company

Hook

What if you could give an AI agent a job at a software company and measure exactly how well it performs real work—merging pull requests, analyzing data, managing projects, and coordinating with teammates? That’s no longer hypothetical.

Context

The AI agent landscape has exploded with frameworks promising to automate complex workflows, but evaluation has lagged behind. Most benchmarks either test isolated capabilities (can it parse JSON?) or use simplified simulations that bear little resemblance to actual work. TheAgentCompany takes a radically different approach: it creates a complete, working software company inside Docker containers—complete with GitLab for version control, Plane for project management, ownCloud for file storage, and RocketChat for team communication—all pre-populated with realistic data. Then it asks agents to do actual work: fix bugs, analyze sales data, onboard new employees, review financial reports, and coordinate across tools.

This matters because organizations considering AI adoption need to know how agents perform on consequential, multi-step tasks that mirror real work, not sanitized academic problems. Similarly, researchers building agent architectures need rigorous evaluation environments that can’t be gamed with prompt engineering against mocked APIs. TheAgentCompany provides both: 175 tasks spanning software engineering, product management, data science, HR, and finance roles, each isolated in its own Docker container with encrypted evaluators.

Technical Insight

The architecture is deceptively straightforward but deeply considered. At the foundation, TheAgentCompany runs four primary services using Docker Compose with host networking: GitLab (a complete Git forge with CI/CD), Plane (an open-source project management tool similar to Jira), ownCloud (file storage and sharing), and RocketChat (team messaging). Each service is restored from pre-baked backup data that creates a fictional company context—existing repositories, ongoing projects, team conversations, and file hierarchies that agents must navigate.

Each of the 175 tasks is packaged as a separate Docker image with a consistent structure. The /utils/init.sh script initializes task-specific state, the /instruction/task.md provides the task description an agent receives, and /utils/evaluator.py.enc contains encrypted evaluation logic. When you’re ready to evaluate, you decrypt and run the evaluator:

docker run --name task_container --network host -it \
  theagentcompany/task_hr_resume_screening:1.0.0 /bin/bash

# Inside the container
SERVER_HOSTNAME=localhost \
LITELLM_API_KEY=sk-your-key \
LITELLM_BASE_URL=https://api.openai.com/v1 \
LITELLM_MODEL=gpt-4 \
bash /utils/init.sh

# Agent works on the task at /instruction/task.md
# Then evaluate
python /utils/eval.py

The evaluation system itself is a hybrid design combining deterministic checks with LLM-based grading. For tasks with objectively verifiable outcomes—like “ensure the GitLab pipeline passes” or “the correct file exists in ownCloud”—deterministic validators provide fast, reliable assessment. For tasks requiring judgment—like “does this project plan adequately address the requirements?” or “is this code review comment constructive?”—an LLM evaluator (configurable via LiteLLM) assesses quality. Critically, many tasks include intermediate subcheckpoints that validate progress at multiple stages. Instead of a binary pass/fail, you can see that an agent successfully identified the correct repository (checkpoint 1) and created a branch (checkpoint 2) but failed to submit the merge request (checkpoint 3). This granular feedback is invaluable for debugging agent behavior and understanding failure modes.

The integration with OpenHands demonstrates the benchmark’s extensibility. The evaluation harness automatically spins up task containers, injects agent trajectories, and collects results across all 175 tasks. The setup script handles the complexity:

sudo su
cd evaluation
bash run_eval.sh \
  --agent-llm-config gpt-4 \
  --env-llm-config gpt-3.5-turbo \
  --outputs-path ./results \
  --server-hostname localhost \
  --version 1.0.0

This separation of concerns is elegant: the benchmark provides the environment and tasks, while agent frameworks like OpenHands provide the agentic behavior. You can plug in any agent that can browse the web, execute code, and make API calls. The 30+ GB of disk space requirement isn’t arbitrary—it’s what you need to run actual enterprise software with realistic datasets rather than toy examples.

One sophisticated design choice is the task diversity. Tasks aren’t just variations on “write code”—they span roles and require different tool combinations. A data science task might require pulling data from ownCloud, analyzing it with Python, and posting results to RocketChat. An HR task might involve screening resumes in ownCloud, updating candidate status in Plane, and scheduling interviews via RocketChat coordination. A software engineering task might require understanding requirements in Plane, implementing changes in GitLab, and documenting the fix. This cross-tool choreography is where agents either shine or collapse, and it’s what makes TheAgentCompany more than just another coding benchmark.

Gotcha

The infrastructure requirements are real and non-negotiable. You need Docker with host networking enabled and 30+ GB of free disk space. As a reference, the baseline experiments used Amazon EC2 t3.2xlarge instances (8 vCPUs, 32 GB RAM), which gives you a sense of the computational resources you’ll likely want for stable performance. This isn’t something you’ll run on your laptop during a coffee break. Mac and Windows users face additional friction—host networking requires specific Docker Desktop configuration, and the setup documentation explicitly calls out platform-specific issues. If your organization runs restrictive network policies, the setup script’s dependency on pulling resources from GitHub during initialization can cause silent failures and mysterious stalls.

The encrypted evaluators, while likely preventing test contamination, also reduce transparency. You can’t easily inspect evaluation criteria while developing agents, which means you’re debugging in the dark until you run the full evaluation. This appears intentional to prevent overfitting, but it makes iterative development slower. Additionally, LiteLLM is used for environment LLM configuration, which means you need to understand this abstraction layer and manage API keys for evaluation, not just for your agent.

Finally, the benchmark is software-company-specific. If you’re building agents for healthcare, legal work, or other professional domains, you’ll need to look elsewhere or extend the benchmark significantly. The 175 tasks are comprehensive within their scope but narrow in the universe of possible professional work. And while the README describes extensibility features, actually adding new tasks requires understanding the Docker image structure, encryption mechanisms, and evaluation harness—not trivial for casual contributors.

Verdict

Use TheAgentCompany if you’re serious about evaluating AI agents on realistic, multi-step professional workflows and have the infrastructure to support it. This is the benchmark for researchers comparing agent architectures, for organizations conducting rigorous evaluations before workplace AI adoption, or for teams building agents that need to coordinate across multiple enterprise tools. The investment in setup pays off in evaluation quality that no lightweight benchmark can match. Skip it if you need quick sanity checks, are evaluating single-tool capabilities, lack the hardware resources (or patience) for multi-container deployments, or work outside software company contexts. Also skip if you’re just getting started with agent development—the opacity of encrypted evaluators and infrastructure complexity will slow you down. Start with something lighter like SWE-bench, then graduate to TheAgentCompany when you need the full stress test of real-world professional simulation.

// QUOTABLE

What if you could give an AI agent a job at a software company and measure exactly how well it performs real work—merging pull requests, analyzing data, managing projects, and coordinating with tea...

[ Tweet This ]
// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/developer-tools/theagentcompany-theagentcompany.svg)](https://starlog.is/api/badge-click/developer-tools/theagentcompany-theagentcompany)