Back to Articles

TTI: Training Web Agents That Get Smarter By Learning From Their Own Mistakes

[ View on GitHub ]

TTI: Training Web Agents That Get Smarter By Learning From Their Own Mistakes

Hook

What if your AI agent could run the same task ten times and only train on the attempts that actually worked? That’s not data augmentation—that’s filtered behavioral cloning, and it’s changing how we think about test-time compute in agent learning.

Context

Web navigation agents have a dirty secret: they’re expensive to train and even more expensive to make reliable. Traditional approaches either rely on massive pre-collected datasets of human demonstrations (which quickly become stale) or use reinforcement learning with reward models that struggle with the sparse, delayed feedback of multi-step web tasks. The result? Agents that work great in demos but fall apart when you need them to book a flight or debug a GitHub issue.

TTI (Test-Time Interaction) takes a different approach rooted in online learning. Instead of pre-collecting a fixed dataset, it puts agents into real web environments—WebArena’s simulated e-commerce sites and social networks, or WebVoyager’s live web tasks—and lets them explore. But here’s the key insight: it only trains on trajectories where the agent actually succeeded. This filtered behavioral cloning approach creates a virtuous cycle where agents generate their own training data, filter out the noise, and iteratively improve their policy. Combined with test-time scaling (letting agents verify answers multiple times), TTI demonstrates that you can trade compute for reasoning quality at inference time, achieving state-of-the-art results on challenging web navigation benchmarks.

Technical Insight

Test-Time Scaling

Training Phase

Filtering Phase

Data Collection Phase

screenshots + HTML

clicks, typing, nav

task outcomes

success

failure

updated weights

next iteration

inference

correct

incorrect, retry

Web Environment

WebArena Docker

VLM Agent

vLLM Inference

Trajectories

obs + actions + outcomes

Success?

Keep Trajectory

Discard

DeepSpeed Training

Update VLM Weights

Vision-Language Model

Qwen/VLM

Multiple Attempts

min_try=3-10

Verify Answer

System architecture — auto-generated

The architecture is a three-stage loop that runs over multiple iterations. First, agents interact with web environments using vLLM for fast parallel inference, collecting trajectories of observations (screenshots and HTML), actions (clicks, typing, navigation), and task outcomes. Second, the system filters these trajectories based on success—if an agent completed a task correctly, that trajectory becomes training data; if it failed, it’s discarded. Third, the vision-language model (typically a fine-tuned Qwen or similar VLM) gets updated using DeepSpeed’s distributed training on the filtered data, then the cycle repeats.

The test-time scaling mechanism is elegantly simple but powerful. Rather than treating each task as a single-shot problem, TTI allows agents to make multiple attempts and verify their answers. Here’s how you’d configure it in the evaluation pipeline:

# From the TTI evaluation configuration
eval_config = {
    'min_try': 3,  # Agent gets 3 attempts per task
    'verify_answer': True,  # Enable answer verification
    'rollout_budget': 15,  # Maximum actions per attempt
    'temperature': 0.7  # Sampling temperature for exploration
}

# The agent will try multiple reasoning paths
for attempt in range(eval_config['min_try']):
    trajectory = agent.rollout(task, max_steps=eval_config['rollout_budget'])
    if verify_success(trajectory, task.ground_truth):
        break  # Stop early if we succeed
    # Otherwise, continue with next attempt

This “min_try” parameter creates a thinking vs. doing tradeoff—you’re spending more inference compute (multiple rollouts) to achieve higher success rates without retraining. In TTI’s experiments, going from 1 to 10 attempts improved WebArena success rates by 15-20 percentage points, demonstrating that test-time compute scales effectively even after training.

The data collection infrastructure is designed for scale. TTI includes a reference script for spinning up multiple WebArena Docker containers in parallel, allowing you to collect hundreds of trajectories simultaneously. During training iterations, vLLM handles batched inference across these environments:

# Simplified version of parallel rollout collection
from vllm import LLM
import asyncio

llm = LLM(model="checkpoints/iteration_2", 
          tensor_parallel_size=4,
          max_num_seqs=128)  # High batch size for throughput

async def collect_trajectories(env_pool, num_trajectories=1000):
    trajectories = []
    tasks = [env.get_task() for env in env_pool]
    
    # Batch inference across all environments
    while len(trajectories) < num_trajectories:
        observations = [env.get_observation() for env in env_pool]
        actions = llm.generate(observations, sampling_params)
        
        for env, action in zip(env_pool, actions):
            obs, reward, done, info = env.step(action)
            if done:
                if info['success']:  # Filter: only keep successful trajectories
                    trajectories.append(env.get_trajectory())
                env.reset()  # Start new episode
    
    return trajectories

The filtering criterion is critical. Unlike behavioral cloning that blindly imitates all demonstrations, TTI only trains on trajectories where info['success'] == True. This means the agent never learns from its failures directly—instead, it explores until it finds working solutions, then reinforces those behaviors. Over multiple iterations, the agent’s policy improves, making successful trajectories more likely, which in turn provides better training data for the next iteration.

For training, TTI uses DeepSpeed with specific configurations for vision-language models. The setup assumes you have at least 4x H100 GPUs (80GB each), using ZeRO stage 2 for optimizer state partitioning. Training happens with LoRA adapters rather than full fine-tuning to keep memory manageable:

# DeepSpeed config for VLM training (simplified)
ds_config = {
    "train_batch_size": 128,
    "gradient_accumulation_steps": 4,
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {"device": "cpu"},
    },
    "fp16": {"enabled": True},
    "gradient_clipping": 1.0,
}

# LoRA config for parameter-efficient training
lora_config = {
    "r": 64,  # LoRA rank
    "lora_alpha": 128,
    "target_modules": ["q_proj", "v_proj", "visual_proj"],
    "lora_dropout": 0.05,
}

The repository includes pre-trained checkpoints at various iteration stages, so you can either start from scratch or jump in mid-training to continue improving from their published results. The iterative nature means you don’t need to commit to a full training run—you can stop after a few iterations and evaluate whether the performance gains justify the compute cost for your use case.

Gotcha

The compute requirements are brutal. The documentation recommends “minimum 4x H100 GPUs,” but that’s not a suggestion—it’s a hard requirement for any reasonable training speed. Even with that hardware, a single training iteration (collect 1000 trajectories, filter, fine-tune) takes 8-12 hours. If you’re at a startup or academic lab without access to this tier of infrastructure, TTI is effectively a non-starter for training from scratch. You can use the pre-trained checkpoints for inference, but that limits you to their model choices and training decisions.

The framework is also tightly coupled to web navigation. The observation space assumes screenshots plus HTML accessibility trees, the action space is web-specific (click coordinates, form inputs, navigation), and the evaluation harnesses are built around WebArena and WebVoyager. If you want to apply filtered behavioral cloning to robotics, code generation, or other agent domains, you’ll need to significantly refactor the environment interfaces and data collection pipeline. This isn’t a general-purpose RL framework—it’s a specialized system for web agents that happens to use RL techniques.

Verdict

Use TTI if: You’re actively researching web navigation agents with access to serious GPU infrastructure (4+ H100s or equivalent), and you want to explore the intersection of test-time compute scaling and online learning. The pre-trained checkpoints alone make it valuable for reproducing state-of-the-art results on WebArena/WebVoyager, and the full training pipeline is production-ready if you have the hardware. The filtered behavioral cloning approach is also conceptually interesting if you’re designing your own agent training systems—it’s a clean example of how to do online RL without explicit reward modeling. Skip TTI if: You’re working with limited compute (anything less than 4x A100s will be painful), need agents for non-web domains, or want to quickly prototype agent ideas. The heavy infrastructure requirements and narrow domain focus make it a poor choice for exploration or resource-constrained research. For lightweight prototyping, stick with prompting-based methods like ReAct or use LangChain with existing LLM APIs. For other agent domains, look at more general frameworks like AgentTuning that aren’t welded to web navigation primitives.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/ai-agents/test-time-interaction-tti.svg)](https://starlog.is/api/badge-click/ai-agents/test-time-interaction-tti)