AGI SDK: Building a Benchmark Where AI Agents Actually Shop on Amazon (Sort Of)
Hook
Most AI agent benchmarks test whether bots can click buttons on toy websites. AGI SDK went nuclear: they cloned Amazon, DoorDash, Airbnb, and eight other major platforms in full-stack React, then built a harness to watch AI agents try to book vacations and order groceries like actual humans.
Context
The AI agent evaluation problem has become embarrassing. Researchers publish papers claiming their agents can “browse the web” based on tests against localhost forms or academic datasets scraped years ago. Practitioners building real automation face a credibility gap: does my shopping bot work because it’s smart, or because Amazon’s checkout flow happened to match its training data last Tuesday?
AGI SDK emerged from this frustration with non-reproducible benchmarks. Instead of recording real website interactions (which break when sites update) or building toy environments (which don’t capture real complexity), the team took the hard path: pixel-perfect clones of 11 major web applications as deterministic test beds. Each clone is a full React or Next.js application mimicking the real platform’s UI, complete with product catalogs, search functionality, and multi-step workflows. The result is REAL Bench—a leaderboard where agents attempt human-written tasks like “Find the cheapest wireless headphones under $50 and add to cart” against environments that won’t silently change overnight. It’s heavyweight infrastructure for a heavyweight problem: proving your agent can actually complete real-world web tasks without cherry-picked demos.
Technical Insight
AGI SDK’s architecture centers on a harness system that orchestrates the dance between agents, browser environments, and evaluation logic. At its core, it uses Playwright for browser automation—not Selenium’s flaky element selection, but Playwright’s more robust accessibility tree traversal. The observation space is deliberately rich: agents receive the DOM structure, accessibility tree (same data screen readers use), screenshots for vision models, and chat history for conversational context.
The agent interface is elegantly simple. Your agent inherits from Agent and implements two methods:
from agisdk import Agent, Observation
class MyShoppingAgent(Agent):
def __init__(self, model="gpt-4"):
super().__init__()
self.model = model
self.history = []
def act(self, obs: Observation) -> str:
# obs contains: dom, accessibility_tree, screenshot_b64, chat
prompt = self._build_prompt(obs)
self.history.append(prompt)
# Return function call as string: click("#buy-button")
response = self.llm_call(prompt)
return self._parse_action(response)
def reset(self):
self.history = []
Actions are function calls returned as strings: click("button[aria-label='Add to Cart']"), fill("#search-box", "wireless headphones"), press("Enter"), or scroll("down"). This string-based protocol feels primitive until you realize it’s genius for LLM integration—models naturally output function calls, and the harness validates them against Playwright’s API. No brittle action space enums, no serialization headaches.
The evaluation loop runs in parallel across tasks. Each task specifies a website (e.g., amazon_clone), a natural language goal (“Add the cheapest laptop under $800 to your cart”), and programmatic success criteria:
task = {
"site": "amazon_clone",
"goal": "Find and add cheapest laptop under $800 to cart",
"evaluator": lambda state: (
state.cart_items and
any(item.category == "laptop" and item.price < 800
for item in state.cart_items)
)
}
The harness spins up a browser, navigates to the cloned site, feeds observations to your agent for up to N steps (default 30), then runs the evaluator against the final page state. Success rate across task sets determines leaderboard position.
What makes this more than a glorified Selenium wrapper is the observation engineering. The accessibility tree is the secret weapon—it’s how blind users navigate the web, and it’s shockingly effective for AI agents. Instead of parsing raw HTML with thousands of irrelevant divs, agents get a semantic tree: “button ‘Add to Cart’, textbox ‘Search’, heading ‘Product Title’”. Combined with screenshots for vision-language models, agents can reason both structurally (“where is the checkout button in the DOM?”) and visually (“does this product image look damaged?”).
The LLM integration layer abstracts OpenAI, Anthropic, and OpenRouter behind a unified interface. You can swap gpt-4 for claude-3-opus without touching agent logic:
from agisdk.llms import get_llm_client
client = get_llm_client("anthropic/claude-3-opus")
response = client.complete(
messages=[{"role": "user", "content": prompt}],
max_tokens=500
)
action = parse_function_call(response.text)
The cloned websites themselves live in separate repos—full-stack applications you run locally or deploy to staging. The Amazon clone has a SQLite product catalog seeded with deterministic inventory. DoorDash clone includes restaurant menus and cart logic. This is where the “deterministic” promise gets tested: every agent run sees identical product listings, prices, and UI state. No more “it worked yesterday” debugging because Amazon shuffled their homepage layout.
Gotcha
The elephant in the room: these are clones, not the real sites. AGI SDK trades realism for reproducibility, and that trade-off bites in subtle ways. The clones capture major UI patterns but miss edge cases—loading states that take 3 seconds on real Amazon but are instant locally, anti-bot CAPTCHAs that would block real agents, A/B tested layouts that confuse agents trained on the clone’s single UI version. An agent that aces REAL Bench might faceplant on production Amazon because it never learned to handle rate limits or dynamic content insertion. The sim-to-real gap is narrower than toy benchmarks, but it’s still a chasm.
Documentation is the other pain point. The README shows a quickstart example, then… ends. How do you integrate custom observations? What’s the full action space? How do you add new websites to the benchmark? You’re reading example code in examples/ to reverse-engineer patterns. The leaderboard submission process is vague—there’s mention of uploading results but no API docs. For a project positioning itself as infrastructure for the agent research community, the “read the source” approach will frustrate teams wanting to move fast. The Playwright dependency also means heavyweight setup: browser binaries, headless rendering, non-trivial CI/CD integration if you want to run evals in your testing pipeline.
Verdict
Use if: You’re building web automation agents (shopping bots, travel assistants, data scrapers) and need a credible benchmark to track progress without fooling yourself with toy tasks. Use if you’re publishing agent research and want reviewers to take your evaluation seriously—REAL Bench’s deterministic environments mean your results are actually reproducible. Use if you’re comparing LLMs for agentic tasks and want standardized web scenarios beyond “answer questions” or “write code”. Skip if: You need mobile app or desktop automation (this is browser-only), you’re working with real websites where determinism isn’t required (just use Playwright directly), or you want lightweight testing without running full React apps locally. Skip if you need extensive documentation and polish—this is a research tool that expects you to dig into code. The verdict: AGI SDK is the most serious web agent benchmark available in open source, but you’re paying for that seriousness with complexity and setup overhead.