RF-Agent: Using Monte Carlo Tree Search to Generate Reward Functions for Reinforcement Learning
Hook
What if designing reward functions for reinforcement learning could be framed as a tree search problem, where an LLM navigates through possibilities like a chess AI plans moves?
Context
Anyone who’s worked with reinforcement learning knows the reward function is everything—it’s the objective that shapes agent behavior, and getting it wrong means your robot learns to game the system rather than accomplish the task. For complex continuous control tasks like robotic manipulation or locomotion, crafting dense reward functions requires deep domain expertise and iterative refinement. Recent work like Eureka demonstrated that Large Language Models could generate reward code directly from task descriptions, but these approaches often struggle to efficiently utilize historical feedback during the search process.
RF-Agent, developed by researchers at Beihang University and accepted as a NeurIPS 2025 Spotlight paper, takes a fundamentally different approach. Instead of treating reward generation as a one-off coding task, it frames reward function design as sequential decision-making and applies Monte Carlo Tree Search—the same algorithm that powered AlphaGo—to navigate the space of possible reward formulations. The framework treats GPT-4o-mini (the recommended model) as a language agent that proposes modifications, learns from training outcomes across diverse IsaacGym and Bi-DexHands tasks, and builds a search tree of reward function variants. This architectural shift aims to enable more effective utilization of historical information and improved search efficiency compared to greedy or evolutionary baselines.
Technical Insight
The core innovation in RF-Agent is reframing reward engineering from optimization to planning. Traditional approaches like Eureka prompt an LLM to generate complete reward functions, evaluate them, then generate new candidates based on feedback. RF-Agent instead decomposes this into a sequence of modification decisions, where each node in an MCTS tree represents a reward function state, and edges represent refinements proposed by the LLM.
The MCTS cycle follows the classic selection-expansion-simulation-backpropagation pattern. During selection, the algorithm traverses the tree using Upper Confidence Bounds applied to Trees (UCT) to balance exploitation of promising reward functions with exploration of untested modifications. When it reaches a leaf node, the expansion phase prompts the LLM to propose new reward function modifications based on the current code and accumulated training feedback. The simulation phase is where RF-Agent diverges from game-playing MCTS: instead of rollouts, it actually trains RL agents using the proposed reward function. Finally, backpropagation updates node statistics throughout the path based on training performance metrics.
Here’s how you’d run RF-Agent on the Ant locomotion task with 80 MCTS simulations:
python rfagent.py env=ant model=gpt-4o-mini simulations=80 max_iterations=500 test_max_iterations=500
The max_iterations parameter controls training duration during search (set to 500 here), while test_max_iterations runs full-length training for final evaluation. According to the README, max_iterations is often set to half of test_max_iterations to increase search speed, though the specific benchmark defaults vary by task (detailed in different_environment_itration_nums.txt). The framework automatically handles parallel environment initialization, reward function hot-swapping, and metric collection across seeds.
The system’s dependency on modified IsaacGym and Bi-DexHands packages reveals an important architectural choice. To enable automatic reward function replacement during search, the authors made minimal modifications to these simulators (following Eureka’s approach). Rather than using standard pip-installable packages, you download a modified bundle from Google Drive and install locally. This design trades some deployment convenience for the capability to inject generated Python reward code directly into running environments.
For evaluation, RF-Agent includes pre-generated reward functions from multiple methods. You can benchmark the RF-Agent-generated reward for Ant against alternatives:
python test.py env=ant test_reward_function="/reward_functions/isaac/RFAgent/Ant-4o-mini.py" test_max_iterations=500
The repository provides comparable rewards from Eureka and Revolve (an evolutionary baseline), enabling direct performance comparisons on the same tasks. Each evaluation runs across multiple random seeds by default, though you can limit to seed 0 with num_eval=1 to reduce memory consumption—important because each seed spawns thousands of parallel environments.
The multi-stage contextual reasoning capability is what separates RF-Agent from simpler LLM-based approaches. Rather than treating each reward generation as independent, the MCTS tree structure accumulates knowledge about reward formulations and their outcomes, providing richer context for the LLM when proposing modifications.
Gotcha
The primary challenge is computational cost. RF-Agent requires hardware capable of running IsaacGym simulation (typically GPUs, as IsaacGym is GPU-accelerated), OpenAI API credits for LLM calls during MCTS search, and substantial memory for parallel environments. The README explicitly warns that each seed corresponds to parallel sampling of thousands of environments consuming significant memory, suggesting runs on hardware most individual developers don’t have readily available. The recommendation to set num_eval=1 for testing underscores this—even evaluation is resource-intensive enough that you might want to skip most seeds.
The dependency on modified simulator packages distributed via Google Drive creates deployment friction. This approach ensures consistency for research reproducibility—everyone uses identical environment code for automatic reward function replacement—but creates maintenance considerations. You can’t simply pip install isaacgym and start working; you need to download specific modified versions that may not receive updates if IsaacGym itself evolves. The installation instructions explicitly recommend using their modified packages “rather than redeploying environment methods such as Bidex and rlgames,” suggesting incompatibility with standard installations.
Documentation gaps may limit extensibility. While the README covers running the provided tasks, it offers limited guidance on hyperparameter tuning, detailed MCTS configuration, or extending to custom environments beyond those included. The reference to different_environment_itration_nums.txt for task-specific iteration counts indicates environment-dependent configuration that isn’t fully documented in the main README. If you want to apply RF-Agent to your own novel task or modify the MCTS search strategy, you’ll likely need to read source code rather than consulting comprehensive documentation.
Verdict
Use RF-Agent if you’re conducting research in automated reward engineering for continuous control, have access to GPU compute and API budgets, and work within IsaacGym or Bi-DexHands environments. It’s particularly valuable when you need reward functions for the supported tasks and want to compare against Eureka or Revolve baselines using the provided pre-generated rewards. The NeurIPS Spotlight acceptance signals solid contributions that make it worth engaging with despite setup friction. Skip it if you need production-ready tooling with minimal dependencies, work outside the supported simulators, have limited computational resources, or require plug-and-play integration with standard RL libraries. The modified environment packages, resource requirements, and research-oriented codebase make this best suited for academic exploration rather than immediate practical deployment. For simpler use cases where you can afford manual reward tuning, traditional domain expertise may still be more cost-effective than the computational overhead of tree search over LLM-generated code.