Back to Articles

TTI: Teaching Web Agents to Learn from Their Own Mistakes at Test Time

[ View on GitHub ]

TTI: Teaching Web Agents to Learn from Their Own Mistakes at Test Time

Hook

What if your AI agent got smarter not by being larger, but by having more chances to verify its answers during inference? That’s the core insight behind Test-Time Interaction.

Context

Most research into improving AI agents focuses on scaling model parameters or adding more pre-training data. But there’s a third dimension: scaling test-time computation. TTI (Test-Time Interaction) explores this frontier by giving agents multiple attempts to verify their answers during inference, combined with an online reinforcement learning loop that trains only on successful trajectories.

The framework targets web navigation tasks—specifically WebArena and WebVoyager benchmarks—where agents must interact with websites through mouse clicks, form fills, and navigation decisions. These tasks require multimodal reasoning across screenshots and HTML, making them ideal testbeds for the “thinking vs. doing” paradigm. Unlike traditional imitation learning approaches that train on fixed datasets, TTI continuously collects new trajectories from live web environments, filters for successful ones, and updates the policy in a tight loop. This creates a self-improving system where the agent learns from its own successes rather than from human demonstrations alone.

Technical Insight

TTI’s architecture splits cleanly into two phases: rollout collection and policy updates. During rollout collection, the framework uses vLLM for fast parallel inference across multiple environment instances. This is critical because web environments are slow—each action requires rendering, waiting for page loads, and processing visual feedback. By running many environments in parallel, TTI amortizes this latency.

The policy update phase uses DeepSpeed for distributed training across multiple H100 GPUs. But here’s where it gets interesting: TTI doesn’t train on all trajectories. It implements filtered behavioral cloning, training only on trajectories that successfully complete the task. This filtering is controlled through the configuration parameter min_try, which determines how many verification attempts an agent gets before submitting a final answer:

# From configuration
# When min_try=2, the agent can re-check its answer
# This enables test-time interaction scaling
min_try: 2  # Number of attempts before final submission

# The agent gets multiple chances to verify
# Failed attempts are filtered out during training
# Only successful multi-step reasoning chains are used for updates

The curriculum learning schedule is defined directly in the training scripts, alternating between data collection and model updates. You control this in webvoyager_train.sh or webarena_train.sh, where you can specify how many rollout-update cycles to run and how many trajectories to collect per cycle. The rollout_size parameter determines batch size for experience collection, while batch_size controls the training batch for gradient updates.

For WebArena specifically, TTI requires a unique infrastructure setup. Because web environments are stateful and slow to reset, the framework benefits from running multiple WebArena containers in parallel. The repository includes scripts/create_webarena_containers.sh to spawn multiple Docker containers on a single machine, each serving as an independent environment instance. You then configure webarena_host addresses in the config files to point to these containers:

# In webarena_rl.yaml
webarena_host:
  - http://localhost:8001
  - http://localhost:8002
  - http://localhost:8003
  - http://localhost:8004
# Each host runs a full WebArena stack
# vLLM distributes rollout collection across all hosts

The vision-language integration deserves attention too. TTI works with multimodal models that process both screenshots and HTML structure. The policy_lm configuration parameter specifies the base model, and the framework handles the multimodal prompt construction automatically. During evaluation, you can use the released checkpoints (sjunhongs/tti_webvoyager and sjunhongs/tti_webarena on HuggingFace) which are already fine-tuned through this online RL process.

One architectural decision that stands out: the separation of inference and training infrastructure. vLLM excels at fast batched inference but isn’t designed for gradient updates. DeepSpeed excels at distributed training but has higher latency for inference. By using each tool for its strength, TTI achieves both fast data collection and efficient training. The tradeoff is operational complexity—you need to manage two different frameworks with different GPU memory requirements and parallelization strategies.

The training loop also supports real-time progress tracking through task subsets. You can specify a webvoyager_subset.jsonl or similar file containing a small validation set, and TTI will periodically evaluate on this subset during training. This gives you early signals about whether the policy is improving without waiting for full benchmark evaluation.

Gotcha

The hardware requirements are the first brick wall you’ll hit. The README states “We use at least 4x NVIDIA H100 GPU for both training and evaluation”—this reflects the authors’ setup rather than a strict minimum, though the combination of vLLM for parallel inference and DeepSpeed for distributed training means substantial GPU resources are genuinely required. For researchers without access to high-end hardware, this makes TTI challenging for reproduction or experimentation.

WebArena setup complexity is the second major limitation. The official WebArena benchmark requires spinning up multiple Docker containers (shopping site, forum, knowledge base, etc.), each with their own databases and state. TTI amplifies this complexity by encouraging parallel container deployments for faster data collection. If a single WebArena setup takes an afternoon, configuring four parallel instances with proper networking and resource isolation can consume days. The repository provides create_webarena_containers.sh as a reference, but you’ll need solid Docker and networking knowledge to debug issues.

Documentation has some gaps around hyperparameter tuning and prompt engineering. While the config files are well-structured, there’s minimal guidance on how to adapt TTI to new environments or tasks beyond WebArena/WebVoyager. The prompts directory stores agent prompts, and the README mentions “You can generate new prompts by” but doesn’t complete the sentence, leaving this process undocumented.

Finally, the filtered behavioral cloning approach has an inherent limitation: you can only learn from successful trajectories. Early in training, when the policy is weak, you might collect very few successes, leading to sparse gradient signals. The curriculum learning schedule needs careful tuning to ensure enough successful rollouts for meaningful updates.

Verdict

Use TTI if you’re conducting research on web agents with access to multiple high-end GPUs and the engineering resources to manage complex Docker infrastructure. The filtered behavioral cloning approach and test-time interaction scaling represent genuinely novel contributions to agent training, and the released checkpoints provide a strong baseline for WebArena/WebVoyager evaluation. The integration with vLLM and DeepSpeed handles real distributed training challenges effectively. Skip this if you’re working with consumer hardware, need quick experimentation cycles, want to apply these techniques to non-web domains, or lack the DevOps capacity to debug multi-container WebArena setups. For most practitioners, using the pre-trained checkpoints for evaluation makes more sense than reproducing the full training pipeline. If your goal is just to understand test-time scaling for agents, read the paper and experiment with simpler min_try variations in your existing agent framework rather than adopting TTI’s full infrastructure.

// QUOTABLE

What if your AI agent got smarter not by being larger, but by having more chances to verify its answers during inference? That's the core insight behind Test-Time Interaction.

[ Tweet This ]
// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/developer-tools/test-time-interaction-tti.svg)](https://starlog.is/api/badge-click/developer-tools/test-time-interaction-tti)