Back to Articles

WebRL: Teaching Language Models to Navigate the Web Through Self-Evolving Curriculum Learning

[ View on GitHub ]

WebRL: Teaching Language Models to Navigate the Web Through Self-Evolving Curriculum Learning

Hook

Training a web agent typically requires thousands of hand-labeled demonstrations. WebRL flips this model entirely: it generates its own increasingly difficult tasks and learns from its mistakes through reinforcement learning, creating agents using techniques adopted in systems like AutoGLM.

Context

Web automation has been the holy grail of AI assistance for decades—imagine an agent that could book flights, fill forms, or research products by actually navigating websites like a human. Traditional approaches hit a wall: supervised learning from demonstrations works initially but plateaus quickly, requiring endless human-labeled examples for every new scenario. Reinforcement learning seemed promising but suffered from sparse rewards (you only know if you succeeded at the very end) and the absence of curriculum—agents floundered on hard tasks before mastering basics.

WebRL, developed by THUDM with techniques adopted in their AutoGLM agents, addresses both problems with a self-evolving online curriculum. Instead of pre-defining thousands of training tasks, WebRL starts with a base set and automatically generates harder variations as the agent improves. It combines this with an Outcome-supervised Reward Model (ORM) that provides dense feedback signals throughout task execution. The result: agents trained on WebArena-Lite that transfer to real-world phone and browser automation, released in three scales (8B, 9B, and 70B parameters) for different computational budgets.

Technical Insight

Training Loop

action trajectories

interaction results

reward labels

filtered experiences

training data

performance metrics

new harder tasks

value estimates

Actor: Base LLM

GLM-4/LLaMA-3.1

Critic Model

Value Estimation

ORM

Outcome-supervised

Reward Model

WebArena-Lite

Environments

Replay Buffer

Perplexity Filtered

Self-Evolving

Curriculum Generator

System architecture — auto-generated

WebRL’s architecture centers on three coordinated components working in a training loop: an actor (the web agent being trained), a critic (for value estimation), and an ORM for reward labeling. The actor starts from a supervised fine-tuning baseline—you train this using LLaMA-Factory with their provided data before entering the RL loop.

The self-evolving curriculum mechanism is WebRL’s standout innovation. After each training phase, the system analyzes which tasks the agent handles well and generates variations with increased difficulty. This happens through prompt-based task generation that creates new instructions targeting the current capability frontier. For instance, if an agent masters “Find products under $50”, the curriculum generates “Find products under $50 with free shipping and 4+ star reviews”. The task configuration follows WebArena’s format:

{
  "sites": ["shopping"],
  "task_id": 1042,
  "require_login": true,
  "storage_state": "./.auth/shopping_admin_state.json",
  "start_url": "__SHOPPING__",
  "intent": "Find wireless headphones under $50 with Prime shipping",
  "eval": {
    "eval_types": ["string_match"],
    "reference_answers": {
      "exact_match": "N/A"
    }
  }
}

The ORM addresses RL’s sparse reward problem by evaluating trajectories at each step rather than just final outcomes. During interaction, the agent produces action traces in WebArena-Lite environments (shopping sites, GitLab, Reddit, maps). Each trajectory gets processed and scored by the ORM checkpoint (released as webrl-orm-llama-3.1-8b), which learned to predict task success from thousands of labeled examples. This dense feedback signal is critical—the agent learns which specific actions led toward success or failure, not just the final result.

The experience replay mechanism prevents catastrophic forgetting through perplexity-based filtering. WebRL maintains a buffer of historical trajectories from previous training phases, but naively mixing old and new data causes distribution shift problems. The solution: filter historical experiences by computing perplexity under the current actor model. Low perplexity indicates the trajectory is still representative of the agent’s learned policy; high perplexity suggests the data is stale. The processing script handles this:

python scripts/process_data.py \
  --stage 2 \
  --add_reward \
  --rollout_path ./traces/phase3 \
  --experience_paths "phase1_processed.pt", "phase2_processed.pt" \
  --orm_path ./models/webrl-orm-llama-3.1-8b \
  --actor_path ./models/current_actor \
  --output_path ./training_data/phase3_filtered

This command produces two outputs: raw trajectories and filtered trajectories with relevant historical experiences. The stage 2 flag activates perplexity filtering, while add_reward applies ORM labeling. The result is training data that balances new exploration with retention of past skills.

The training loop alternates between collection and optimization. Using run_multinode.sh, you train the actor-critic pair on processed trajectories using standard RL objectives (policy gradient for the actor, TD-learning for the critic). Between phases, you generate new curriculum tasks with gen_task.py, interact with WebArena to collect trajectories, process them with reward labels and filtered experiences, then continue training. This cycle repeats until performance saturates on the target benchmark.

The architecture’s elegance lies in its modularity—each component (curriculum generation, interaction, reward modeling, experience filtering) operates independently but composes into a self-improving system. The released checkpoints (webrl-glm-4-9b, webrl-llama-3.1-8b, webrl-llama-3.1-70b) represent different points in the model scale versus inference cost tradeoff, all trained through this same pipeline.

Gotcha

WebRL’s power comes with significant operational complexity. The multi-stage pipeline requires orchestrating environment instances, multiple models, and data processing between each training phase. You’ll run interactions in WebArena-Lite (which itself needs Docker containers for shopping sites, GitLab instances, etc.), process traces through the ORM, compute perplexities with the current actor, filter experiences, then finally train updated models. Each phase transition involves manual checkpointing decisions and data path management. The repository provides scripts but not automation—you’re responsible for the orchestration.

The tight coupling to WebArena environments is another limitation. While the paper mentions AutoGLM deployment to real websites and phone interfaces, the released code targets WebArena-Lite exclusively. Adapting to new environments means rebuilding the interaction layer, evaluation harness, and likely retraining the ORM on environment-specific success signals. The task configuration format assumes WebArena’s specific site structure (shopping_admin, gitlab, reddit, etc.), and extending beyond these requires non-trivial engineering.

Computational requirements are substantial. You’re simultaneously running actor inference for trajectory collection, critic training, ORM inference for reward labeling, and perplexity computation for experience filtering. The paper’s results used their released 8B-70B models, but training from scratch at these scales demands multi-GPU infrastructure. The run_multinode.sh script suggests distributed training is expected, not optional. For researchers or teams without access to significant compute, experimentation will be constrained to smaller-scale replications or using the pre-trained checkpoints for inference only.

Verdict

Use WebRL if you’re researching curriculum learning for embodied agents, have infrastructure to run WebArena environments and multi-model RL training, and need agents that improve beyond supervised learning’s ceiling. The self-evolving curriculum and experience replay mechanisms represent genuine innovations in agent training methodology, and the released checkpoints provide strong baselines for web agent benchmarks. It’s particularly valuable if you’re building on the AutoGLM architecture or can adapt the WebArena interaction layer to your target environments. Skip if you need quick deployment to arbitrary websites without environment setup, lack multi-GPU training infrastructure, or prefer simpler approaches like prompting GPT-4 with web screenshots (SeeAct-style). Also skip if you’re building narrow automation for specific sites—traditional supervised learning or even rule-based systems are simpler and more reliable when task distribution is fixed. WebRL shines when you need agents that generalize across diverse web tasks and improve continuously, not when you’re automating a single workflow.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/ai-agents/thudm-webrl.svg)](https://starlog.is/api/badge-click/ai-agents/thudm-webrl)