AGI SDK: Building Browser Agents Against Production-Quality Web Replicas
Hook
Most AI agent benchmarks test against toy websites with three buttons and a form. AGI SDK ships with pixel-perfect React clones of Amazon, DoorDash, and Airbnb—complete with search, filtering, and checkout flows.
Context
The explosion of LLM-powered agents has created a benchmarking crisis. Researchers evaluate on different websites, with different observation modalities, using incompatible action spaces. You can't compare GPT-4's performance on a shopping task against Claude's because one was tested on a stripped-down HTML mockup and the other navigated a live production site with dynamic content and authentication flows.
Existing benchmarks like MiniWoB++ use simplified, single-page tasks. WebArena represents a step forward with more realistic environments, but still relies on basic web apps that don't capture the UI complexity of modern React applications—infinite scroll, dynamic modals, client-side routing, skeleton loaders. AGI SDK's REAL Bench solves this by providing deterministic, locally-running replicas of production websites. These aren't simplified versions—they're high-fidelity Next.js clones that preserve the interaction patterns, visual hierarchy, and state management of real applications. An agent that can book an Airbnb in REAL Bench has demonstrated capabilities that transfer to actual production environments.
Technical Insight
The SDK's architecture centers on a harness framework that standardizes the observation-action loop. Every agent, regardless of its underlying LLM or decision-making strategy, receives observations in a consistent format and emits actions as structured function calls. Observations are multi-modal: DOM trees for structural understanding, accessibility trees for semantic relationships, screenshots for visual context, and chat history for conversational tasks. Actions map to browser primitives—click, fill, select, navigate—expressed as Python string function calls.
Here's what a minimal agent implementation looks like:
from agisdk import Agent, Observation
class MyAgent(Agent):
def __init__(self, model="gpt-4"):
super().__init__()
self.model = model
def act(self, observation: Observation) -> str:
# observation.dom - DOM tree
# observation.a11y_tree - Accessibility tree
# observation.screenshot - Base64 image
# observation.goal - Task objective
prompt = self.build_prompt(observation)
response = self.llm_call(prompt)
# Return action as string: "click('checkout-button')"
return response.action
The string-based action interface is deceptively simple but surprisingly powerful. Rather than forcing agents to output JSON schemas or domain-specific languages, actions are just Python function calls as strings. The harness parses these with a lightweight executor that maps them to Playwright commands. This design decision makes it trivial to integrate new agent architectures—you don't need to rewrite action decoders or maintain compatibility layers.
REAL Bench environments run as actual Next.js applications, not static HTML fixtures. When you initialize a task, the SDK spins up a local server, seeds the database with deterministic test data, and navigates Playwright to the starting URL. Each website replica includes:
from agisdk import HarnessEnvironment
env = HarnessEnvironment(
task="amazon_cart_v2",
observation_config={
"dom": True,
"a11y_tree": True,
"screenshot": True,
"chat_history": True
},
max_steps=20,
headless=False # Set True for CI/CD
)
observation = env.reset()
while not env.done:
action = agent.act(observation)
observation, reward, done = env.step(action)
The observation configuration lets you control overhead. Screenshots add latency but give vision-language models crucial visual context. The accessibility tree is lighter-weight than the full DOM but preserves semantic structure. You can benchmark different observation modalities to understand what information agents actually need.
The task versioning system (v1, v2) is elegant. Tasks evolve as the SDK adds UI complexity or tightens success criteria. Version pinning ensures reproducibility—a result from amazon_cart_v1 six months ago remains comparable to today's runs. The leaderboard integration automatically tracks metrics across versions, making it straightforward to measure agent improvements over time.
Parallel evaluation is first-class. The harness supports running multiple environments concurrently, each with isolated browser contexts:
from agisdk import parallel_evaluate
results = parallel_evaluate(
agent=MyAgent(),
tasks=["amazon_cart_v2", "doordash_order_v2", "airbnb_search_v1"],
num_workers=4,
timeout_per_task=300
)
This parallelization is critical for research workflows. Evaluating an agent across 11 websites with multiple trials would take hours serially. With parallel execution and headless browsers, you can iterate on agent designs much faster.
Gotcha
The browser automation stack is heavyweight. Each environment requires a Playwright browser instance with full rendering, which consumes 200-400MB of memory per worker. If you're running parallel evaluations with screenshots enabled, expect to provision significant compute—an 8-core machine with 16GB RAM handles about 6-8 concurrent environments comfortably before memory pressure causes timeouts. This isn't a limitation of AGI SDK specifically, but browser-based evaluation is fundamentally more resource-intensive than API-driven benchmarks.
The task set, while high-quality, is finite. You get 11 website clones covering e-commerce, food delivery, travel booking, and social media. For many agent research questions, this is sufficient. But if you're building domain-specific agents—healthcare portals, financial dashboards, enterprise SaaS—you'll need to create custom environments. The SDK doesn't provide tooling for cloning arbitrary websites or generating new tasks, so extending beyond the included benchmarks requires substantial React/Next.js development work. The deterministic seeding and task infrastructure are tightly coupled to the pre-built environments.
Verdict
Use if: You're building or benchmarking AI agents that navigate modern web applications and need reproducible, high-fidelity evaluation environments. The production-quality replicas and standardized harness make this the best option for research comparisons and leaderboard submissions. It's ideal for rapid prototyping—the 60-second quickstart isn't marketing, you genuinely can have an agent running against Airbnb in under a minute. Also use if you're comparing observation modalities (do agents need screenshots or is the DOM sufficient?) or action strategies across real-world UI complexity. Skip if: You need to evaluate beyond browser interactions—mobile apps, desktop software, CLI tools, or API-driven workflows aren't supported. Also skip if your agents target specialized domains not covered by consumer web apps, since building custom REAL Bench environments requires significant engineering effort. For simple web scraping or traditional test automation, raw Playwright is more appropriate and much lighter-weight.