AgentBoard: Why Your LLM Agent’s 70% Success Rate Might Hide a 10% Grounding Problem
Hook
Your agent solved 70% of tasks in your benchmark. Impressive, right? But what if 80% of its failures stem from misinterpreting observations in the first three turns, and your current metrics can’t tell you that?
Context
LLM agents have graduated from answering questions to navigating websites, manipulating databases, and controlling household robots. Yet most benchmarks report a single number: task success rate. This aggregate metric masks critical failure modes. Does your agent fail because it can’t ground observations to actions? Does it lose track after 10 turns? Does it excel at web navigation but collapse on knowledge graphs?
AgentBoard emerged from HKUST’s NLP lab to address this diagnostic gap. Accepted as an oral presentation at NeurIPS 2024, it’s not another leaderboard chasing higher numbers—it’s an analytical evaluation framework that decomposes agent performance across dimensions like grounding accuracy, progress rates, and sub-skill proficiency. Instead of asking “did the agent succeed?”, AgentBoard asks “where exactly did it fail, and why?”
Technical Insight
AgentBoard’s architecture rests on three pillars: diverse task environments, a standardized agent interface, and an analytical panel built on Weights & Biases. The framework incorporates 9 tasks spanning WebShop and WebArena for web navigation, ALFWorld for household robotics, Database operations, Knowledge Graph reasoning, Operating System control, and interactive games. Critically, all environments are partially observable and require multi-turn interaction—agents must build world models online rather than accessing complete state.
The evaluation flow starts with constructing goal-oriented reflex agents that follow a perceive-act loop. Here’s how you’d evaluate GPT-3.5-turbo on the Database task:
# Set your OpenAI API key
export OPENAI_API_KEY="your-key-here"
# Run evaluation on Database task
python run.py \
--llm_name gpt-3.5-turbo-0613 \
--task_name db \
--method direct \
--split test \
--data_root ./data
This script instantiates the agent, loads task instances from the data directory, and executes the perceive-act cycle. The agent receives partial observations from the environment, generates actions through the LLM, and repeats until goal completion or timeout. Results get logged to results/{task_name}/{llm_name}/ with trajectory files capturing every observation, thought, and action.
The real power emerges in the analytical panel. After running evaluations, launch the W&B visualization:
python wandb_upload.py \
--project_name agentboard-analysis \
--task_list db kg os \
--llm_name gpt-3.5-turbo-0613
This uploads your results to a structured dashboard exposing metrics invisible to traditional benchmarks. The panel breaks down performance by:
- Progress Rate: How far agents advance toward goals across interaction turns, revealing whether failures occur early (grounding issues) or late (planning issues)
- Grounding Accuracy: Whether agents correctly map observations to valid actions, computed by comparing predicted actions against valid action spaces at each turn
- Hard/Easy Performance: Success rates stratified by instance difficulty, identifying whether models struggle uniformly or collapse on complex cases
- Sub-skill Analysis: Task-specific breakdowns—for Database, this appears to include operations like SELECT, JOIN, and aggregation based on the framework’s analytical evaluation approach; for Knowledge Graph, single-hop vs multi-hop reasoning
The framework’s design allows straightforward LLM customization. For open-source models like Llama or Mistral, you modify the LLM wrapper in src/llms/. The standardized interface requires implementing two methods: generate() for single-turn completion and chat() for multi-turn dialogue. For agents, you can extend the base reflex agent in src/agents/ to implement ReAct-style reasoning, chain-of-thought prompting, or retrieval-augmented generation—as long as you maintain the perceive-act loop contract.
One architectural choice stands out: AgentBoard deliberately avoids environment modifications or task simplifications. WebArena runs in Docker containers with full DOM access. Database tasks use actual SQLite instances. This authenticity means evaluation runtime can be extensive depending on timeout configurations, but results reflect real-world agent behavior rather than benchmark-optimized performance.
Gotcha
Setup complexity is AgentBoard’s Achilles heel. The full installation requires about 15 minutes (the README promises a “quick start within 30 minutes” total) and significant disk space—WebArena alone needs ~12GB for Docker images. More problematically, WebArena depends on system services like D-Bus and Xvfb (X virtual framebuffer) that may require sudo permissions. The README’s installation check highlights this:
# Must verify D-Bus is running before proceeding
systemctl status dbus # Should show "active (running)"
dpkg -l | grep xvfb # Should return Xvfb info
If these services aren’t available or you lack admin rights, you’re stuck skipping WebArena or wrestling with Docker configurations. The documentation provides fallback Docker instructions, but this adds another layer of complexity.
Runtime variability presents another challenge. The README’s runtime estimation section (truncated in the source) suggests careful resource planning, but actual duration depends on model latency, timeout settings, and task difficulty. A single Database task instance might complete in 2 minutes with GPT-3.5-turbo but could take significantly longer with a slower open-source model.
Finally, the 9-task coverage, while diverse, has gaps. There’s no code generation task (unlike SWE-bench), no multi-agent collaboration scenarios, and limited coverage of tool-use patterns beyond the predefined environments. If your agent specializes in API orchestration or creative tasks like story writing, AgentBoard won’t illuminate those capabilities. The benchmark is explicitly designed for goal-oriented agents in interactive environments—a valuable but bounded scope.
Verdict
Use AgentBoard if you’re developing generalist LLM agents and need diagnostic depth beyond success rates. It’s essential when comparing multiple models systematically, hunting for failure patterns across interaction turns, or preparing academic research on agent capabilities. The analytical breakdowns justify the setup overhead when you need to understand why your agent fails at turn 7 on knowledge graph tasks but excels on web navigation. Skip it if you’re doing rapid prototyping with tight deadlines, lack computational resources for multi-environment setup (8GB+ RAM, ~20GB disk based on WebArena Docker requirements), or evaluating narrow agents outside the 9 included domains. For quick sanity checks or CI/CD pipelines, lighter benchmarks like individual WebArena runs make more sense. AgentBoard is a research microscope, not a production health check—powerful for deep investigation, overkill for routine monitoring.