Back to Articles

OpenManus-RL: Bringing Reinforcement Learning to LLM Agent Training Through Live-Stream Development

[ View on GitHub ]

OpenManus-RL: Bringing Reinforcement Learning to LLM Agent Training Through Live-Stream Development

Hook

While DeepSeek-R1 and QwQ-32B proved that RL can transform reasoning models, a crucial question remained unanswered: could the same techniques work for interactive agents that need to use tools and navigate environments? OpenManus-RL is testing that hypothesis in real-time.

Context

The success of reasoning models like DeepSeek-R1 and QwQ-32B demonstrated that reinforcement learning can unlock latent capabilities in large language models through extended thinking and chain-of-thought optimization. But these models operate in isolation—they reason, they output, they’re done. Real-world LLM agents face a fundamentally different challenge: they must interact with environments, use tools, recover from errors, and adapt their strategies based on feedback from external systems like web browsers, databases, or APIs.

Traditional agent training relies heavily on supervised fine-tuning (SFT) with curated trajectories—essentially teaching agents by example. While this works for common patterns, it struggles with exploration, error recovery, and optimization for task success rather than behavioral mimicry. OpenManus-RL, a collaborative project between UIUC’s Ulab and MetaGPT, applies the RL paradigm to agent tuning. The project adopts a “live-stream development” philosophy: progress, datasets, roadmaps, and results are shared openly as they emerge, inviting community participation rather than waiting for a polished final product. This approach acknowledges that RL-based agent tuning is still an open research problem where collective experimentation may accelerate discovery.

Technical Insight

environment interactions

structured exploration

trajectories

reward signals

experience data

PPO/GRPO/DPO/PRM

improved policy

Interactive Environments

WebShop, GAIA, OSWorld

Rollout Strategies

MCTS, ToT, GoT, DFSDT

Trajectory Collection

Agent Actions & States

Reward Model

Trajectory Scoring

VERL Training Framework

RL Optimization Loop

Fine-tuned LLM Agent

Improved Reasoning

System architecture — auto-generated

OpenManus-RL’s architecture centers on collecting agent trajectories from interactive environments, then using those trajectories with reward signals to optimize agent behavior through post-training algorithms. The framework integrates VERL (Versatile Reinforcement Learning) as a submodule to handle the actual RL training loop, while the OpenManus-RL codebase focuses on trajectory collection, environment integration, and rollout strategy implementation.

The rollout strategy layer is where things get interesting. Rather than using simple sequential action selection, OpenManus-RL explores structured reasoning approaches including Tree-of-Thoughts (ToT), Graph-of-Thoughts (GoT), and DFSDT (Depth-First Search Decision Trees). These strategies allow agents to explore multiple reasoning paths before committing to actions—critical for environments where a single wrong tool call can derail an entire task. The team is also experimenting with different action formats: ReAct-style reasoning (alternating thought and action steps) versus outcome-based approaches where the agent plans more holistically before execution.

For training data, OpenManus-RL has released an agent SFT dataset on HuggingFace that captures trajectories from environments like WebShop (e-commerce navigation), GAIA (general assistant tasks), OSWorld (operating system interaction), and AgentBench (multi-domain agent challenges). The dataset structure includes not just successful trajectories but also exploration paths, failed attempts, and intermediate states—exactly what’s needed for RL algorithms to learn from trial and error rather than just imitation.

The post-training algorithm selection is deliberately diverse. The team is exploring PPO (Proximal Policy Optimization) for stable on-policy learning, GRPO (Generalized Reward-based Policy Optimization, per the README) for format-based and outcome-based rewards, DPO (Direct Preference Optimization) for learning from comparative feedback, and PRM (Process Reward Models, listed as Preference-based Reward Modeling in the README) for intermediate step validation. This multi-algorithm approach reflects the project’s experimental nature: there’s no consensus yet on which optimization strategy works best for interactive agents versus reasoning models.

A particularly nuanced aspect is the reward model design. Unlike reasoning tasks where correctness can often be verified deterministically, agent tasks require balancing multiple objectives: task completion, efficiency, tool usage appropriateness, and recovery from errors. The README mentions exploring both format-based rewards (did the agent follow proper reasoning structure?) and outcome-based rewards (did it succeed?), using GRPO to combine these signals. This dual-reward approach acknowledges that blindly optimizing for task success might produce brittle agents that work in narrow conditions but fail to generalize.

The integration with specialized reasoning models (GPT-O1, DeepSeek-R1, QwQ-32B) serves a specific purpose: using these models to generate high-quality initial trajectories that demonstrate advanced planning and reasoning. These trajectories seed the dataset with examples of extended deliberation before action—a pattern that SFT alone struggles to instill but which RL can potentially reinforce through reward shaping. The hypothesis is that by showing the agent what careful reasoning looks like, then rewarding successful outcomes, the model will learn when to think longer versus when to act quickly.

Gotcha

OpenManus-RL is currently more roadmap than reality. The project announced on March 6, 2025, with a dataset release on March 9, but concrete model releases, benchmark results, and reproducible training recipes remain absent. The README extensively discusses methods, strategies, and evaluation plans—but provides no empirical evidence that any of these approaches actually work. If you’re evaluating this for immediate research use, you’re essentially signing up to be an early co-developer rather than a user of proven technology.

The heavy dependency on VERL as a submodule introduces potential technical constraints, though the README only notes it has been “integrated for enhanced RL training capabilities” without detailing specific limitations. The exact structure of how VERL dictates agent loops, environment integration, and reward signal provision remains undocumented, making it difficult to assess compatibility with specific use cases before diving in. Additionally, the reliance on proprietary models like GPT-O1 for trajectory generation means full reproducibility requires either API access and budget or substituting with open alternatives (DeepSeek-R1, QwQ-32B) that may yield different quality trajectories. The README also lacks basic operational details: What are the computational requirements for RL training? How long does trajectory collection take? What’s the sample efficiency of different algorithms? These practical questions remain unanswered, making resource planning difficult.

Verdict

Use OpenManus-RL if you’re a researcher actively working on RL-based agent tuning who values early access to curated trajectory datasets and wants to contribute to an evolving open framework. The dataset alone provides value for studying agent behavior patterns across multiple benchmarks, even if you don’t use their training pipeline. It’s also worth exploring if you’re specifically interested in comparing rollout strategies (ToT, GoT, DFSDT) for agent planning and need a structured starting point. Skip it if you need production-ready agent frameworks with proven benchmarks, stable APIs, and clear performance characteristics—this project is explicitly in early development with no published results yet. Also skip if you’re resource-constrained: RL training is compute-intensive, and without published efficiency metrics, you risk burning GPU hours on exploratory experiments. This project is for pioneers comfortable navigating incomplete documentation and contributing back to the community, not practitioners needing battle-tested solutions for deployment.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/ai-agents/openmanus-openmanus-rl.svg)](https://starlog.is/api/badge-click/ai-agents/openmanus-openmanus-rl)