Back to Articles

RF-Agent: Teaching Language Models to Design Reward Functions Through Tree Search

[ View on GitHub ]

RF-Agent: Teaching Language Models to Design Reward Functions Through Tree Search

Hook

Reinforcement learning's hardest problem isn't training the agent—it's designing the reward function that tells it what to optimize. RF-Agent uses GPT-4 and tree search to automate what even expert researchers struggle with.

Context

Reward function design is the Achilles heel of reinforcement learning. Get it wrong, and your robot learns to game the system—flipping indefinitely to maximize "forward progress" or destroying objects to accumulate "manipulation" rewards. Get it right, and complex behaviors emerge naturally. Traditionally, this required domain expertise, weeks of iteration, and intimate knowledge of both the task and the RL algorithm's quirks.

Recent work like Eureka demonstrated that large language models could generate reward functions through evolutionary search—prompting GPT-4 repeatedly and keeping the best candidates. But this greedy approach treats each generation independently, failing to learn from the exploration process itself. RF-Agent reframes the entire problem: instead of treating reward design as one-shot text generation, it models it as sequential decision-making. Each reward function becomes a node in a search tree, with Monte Carlo Tree Search guiding exploration toward promising regions of the design space. The result, accepted as a spotlight paper at NeurIPS 2025, outperforms prior methods across 17 robotic control tasks in IsaacGym and Bi-DexHands environments.

Technical Insight

RF-Agent's architecture is a clever marriage of three components: an LLM that generates candidate reward functions, MCTS that decides which candidates to explore, and a modified RL training pipeline that evaluates each candidate's performance. The system treats reward function code as states in a search tree, where actions are LLM-generated modifications or entirely new functions.

The MCTS loop follows the classic selection-expansion-simulation-backpropagation pattern, but adapted for code generation. During selection, the algorithm traverses the tree using Upper Confidence Bound (UCB) to balance exploitation of high-performing branches with exploration of untested ones. When it reaches a leaf node, the expansion phase prompts the LLM with the current reward function, task description, and crucially, historical feedback from previous evaluations. This contextual reasoning is what separates RF-Agent from simpler approaches—the LLM sees why previous reward functions failed and can make informed modifications.

Here's what a typical reward function generation looks like in the codebase:

# RF-Agent generates reward functions as Python code
# This example from a quadruped locomotion task
def compute_reward(obs_buf, action_buf, next_obs_buf):
    # Velocity tracking reward
    lin_vel_error = torch.sum(torch.square(
        next_obs_buf[:, 0:2] - target_vel), dim=1)
    vel_reward = torch.exp(-lin_vel_error / 0.25)
    
    # Energy penalty for efficiency
    energy_penalty = torch.sum(torch.square(action_buf), dim=1)
    
    # Torso orientation penalty
    torso_quat = next_obs_buf[:, 3:7]
    up_vec = quat_rotate(torso_quat, torch.tensor([0, 0, 1]))
    orientation_penalty = torch.square(1.0 - up_vec[:, 2])
    
    # Combined reward with learned weights
    reward = vel_reward - 0.01 * energy_penalty - 0.5 * orientation_penalty
    return reward

The LLM generates functions following this template, adjusting reward components, scaling factors, and combinations based on training feedback. The simulation phase runs full RL training for each candidate—typically 50-100 million environment steps across thousands of parallel environments in IsaacGym. This is where the computational cost becomes significant: each MCTS search might evaluate 80-512 different reward functions, each requiring complete training runs.

The backpropagation phase updates node values based on training performance metrics: final task success rate, average episode reward, convergence speed, and behavior quality assessments. These values inform future UCB calculations, creating a feedback loop that steers the search toward productive regions of the reward design space.

What makes this effective is the multi-stage contextual reasoning. When the LLM generates a new reward function, it receives a prompt structured like:

prompt = f"""
Task: {task_description}
Current reward function performance: {metrics}
Previous attempts and failures:
{format_history(search_tree_path)}

Generate an improved reward function that addresses:
1. {identified_failure_mode_1}
2. {identified_failure_mode_2}

Reward function code:
"""

This historical context allows the LLM to avoid repeating mistakes. If previous functions caused the robot to fall frequently, the new candidate might add stability penalties. If energy consumption was too high, it adjusts penalty weights. The tree structure naturally organizes this exploration—similar reward design strategies cluster together, and the MCTS algorithm learns which branches are worth pursuing.

The implementation modifies IsaacGym's core training loop to enable hot-swapping of reward functions. Rather than hardcoding rewards into environment definitions, RF-Agent dynamically injects generated code into the training process. This required patching several layers of the IsaacGym and IsaacGymEnvs packages, which is why the repository includes modified versions rather than using standard pip installations. The reward function is evaluated at each timestep across all parallel environments, making vectorization and GPU compatibility critical—poorly designed reward functions can become computational bottlenecks.

Gotcha

The computational requirements are formidable and the dependencies are messy. Each MCTS search with 80 simulations means 80 complete RL training runs, each potentially taking hours on GPU clusters with thousands of parallel environments. Memory footprint is substantial—IsaacGym's parallel simulation already demands significant GPU memory, and RF-Agent multiplies this across multiple search iterations. Even using GPT-4o-mini for cost efficiency, hundreds of LLM API calls per task add up quickly.

The repository requires modified versions of IsaacGym, IsaacGymEnvs, and Bi-DexHands rather than standard installations, creating maintenance burden and compatibility risks. Installation involves cloning multiple repositories, applying patches, and managing interdependencies. Generalization beyond low-level control tasks in simulation remains uncertain—real-world robotics introduces perception challenges and safety constraints not addressed here, and applying this to other domains would require significant adaptation.

Verdict

Use RF-Agent if you're researching automated reward design for robotic control and need state-of-the-art performance on complex manipulation or locomotion tasks where hand-engineered rewards consistently fail. The MCTS-guided approach excels when the reward design space is large and greedy search gets trapped in local optima. It's particularly valuable for academic research comparing against baselines or exploring the intersection of LLMs and RL. Skip it if you're working in production environments that need stable dependencies, have limited computational budget (both GPU compute and LLM API costs are significant), or need solutions outside the IsaacGym/Bi-DexHands ecosystems. For simpler tasks or rapid prototyping, direct prompting methods like Eureka offer easier setup with reasonable performance. For real-world robotics, classical inverse RL or learning from demonstration may be more practical until sim-to-real transfer is better understood.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/ai-agents/deng-ai-lab-rf-agent.svg)](https://starlog.is/api/badge-click/ai-agents/deng-ai-lab-rf-agent)