Back to Articles

RF-Agent: Teaching Language Models to Design Reward Functions Through Tree Search

[ View on GitHub ]

RF-Agent: Teaching Language Models to Design Reward Functions Through Tree Search

Hook

The hardest part of reinforcement learning isn’t the learning algorithm—it’s writing the reward function. What if an LLM could explore thousands of reward formulations and converge on the best one through structured search rather than random guessing?

Context

Anyone who’s worked with reinforcement learning knows the reward engineering problem: you spend weeks tuning a single reward function, only to discover your robot learned to exploit a loophole or optimized for the wrong behavior entirely. A manipulation task might need balancing gripper force, object position, orientation stability, and movement efficiency—each requiring careful weight tuning. Get it wrong and your agent either does nothing or discovers creative ways to maximize reward without solving the task.

Recent work like Eureka demonstrated that large language models could generate reward functions from task descriptions, but the approach was essentially evolutionary search: sample many candidates, keep the best, repeat. RF-Agent takes a fundamentally different approach by treating reward function design as a sequential decision problem solved through tree search. Instead of randomly exploring the reward space, it builds a search tree where each node represents a reward formulation and edges represent refinements. This structured exploration means finding good solutions with far fewer expensive simulation evaluations—critical when each candidate requires training an RL agent to convergence.

Technical Insight

Runtime Loop

Proposes reward

function modifications

Selects node

to expand

Python reward

function code

Episode rollouts

observations

Success rate

avg reward

failure modes

Current performance

metrics

Stores evaluated

nodes

LLM Controller

GPT-4o-mini

Tree Search Engine

Reward Function Nodes

Reward Function Node

Python Code + Metrics

Modified IsaacGym

Parallel Simulation

Parallel Evaluation Engine

Performance Metrics

System architecture — auto-generated

RF-Agent’s architecture consists of three main components: a modified IsaacGym environment that accepts reward functions as runtime-swappable Python code, an LLM-powered tree search controller, and a parallel evaluation engine. The system treats GPT-4o-mini (or similar models) not as a black-box code generator but as a policy that selects which reward function node to expand next based on previous simulation results.

The tree search operates over reward function implementations. Each node stores a Python function that calculates rewards from environment observations. The LLM receives the current task description, the reward function code at a given node, and its performance metrics, then proposes modifications. Here’s what a typical reward function node looks like in the shadow hand manipulation task:

def compute_hand_rewards(
    obs_buf, rew_buf, reset_buf, progress_buf, successes, 
    consecutive_successes, max_episode_length, 
    object_pos, object_rot, target_pos, target_rot,
    fingertip_pos, hand_base_pos
):
    # Distance reward: encourage fingertips toward object
    dist_reward = torch.exp(-5 * torch.norm(
        fingertip_pos - object_pos, dim=-1
    ))
    
    # Orientation alignment reward
    quat_diff = quat_mul(object_rot, quat_conjugate(target_rot))
    rot_reward = 1.0 - 2.0 * torch.asin(
        torch.clamp(torch.norm(quat_diff[:, 1:], dim=-1), max=1.0)
    )
    
    # Goal achievement bonus
    goal_dist = torch.norm(object_pos - target_pos, dim=-1)
    goal_reward = (goal_dist < 0.1).float() * 10.0
    
    # Action regularization penalty
    action_penalty = -0.01 * torch.sum(torch.square(actions), dim=-1)
    
    # Weighted combination
    reward = 2.0 * dist_reward + 3.0 * rot_reward + goal_reward + action_penalty
    
    return reward

The LLM doesn’t just randomly mutate this function. Instead, it receives feedback like “Task success rate: 15%, average episode reward: 42.3, common failure mode: object slips from grasp before reaching target.” With this context, it might propose adding a grasp stability term or increasing the weight on the distance reward component. This proposal becomes a child node in the search tree.

The tree search strategy uses Upper Confidence Bounds (UCT) to balance exploitation of promising reward formulations with exploration of untested modifications. The selection criterion looks like:

score = mean_reward + C * sqrt(log(parent_visits) / node_visits)

This ensures that reward functions showing early promise get refined further, while preventing premature convergence to local optima. The system runs parallel IsaacGym instances (512 for Bi-DexHands experiments) to evaluate multiple reward candidates simultaneously, dramatically reducing wall-clock time.

The implementation modifies IsaacGym’s core task definition to accept reward functions as first-class parameters. Instead of hardcoding compute_reward() methods in task classes, RF-Agent dynamically compiles and injects reward functions at initialization. This required patching IsaacGym’s environment creation pipeline to bypass the standard inheritance pattern and allow runtime reward specification. The modified environment accepts a reward_function_code string parameter, compiles it using Python’s exec(), and binds it to the vectorized environment instances.

One clever architectural choice: separating search-time training iterations from final evaluation iterations. During tree search, each reward candidate trains for max_iterations (like 300 steps) to get a rough performance estimate. Only the final selected reward function trains for the full test_max_iterations (like 2000 steps) to report final metrics. This 6-7x speedup in evaluation time makes exploring dozens of candidates feasible within reasonable compute budgets.

The codebase provides pre-generated reward functions from RF-Agent, Eureka, and Revolve baselines, stored as executable Python files. This enables reproducibility without re-running expensive LLM queries. You can directly compare approaches by loading different reward functions into the same environment configuration—invaluable for ablation studies and method comparisons.

Gotcha

The hardware requirements are substantial and non-negotiable. IsaacGym Preview 4 requires NVIDIA GPUs and won’t run on CPU-only machines or AMD hardware. The recommended setup uses 512 parallel environments for Bi-DexHands tasks, which demands high-end GPU memory (the paper mentions experiments on unspecified GPU configurations, but similar work typically requires RTX 3090 or better). If you’re working on a laptop or cloud instance without dedicated GPUs, you’re locked out.

The dependency on OpenAI’s API introduces both cost and reliability concerns. Running a full tree search with dozens of expansions means dozens of API calls, and each call includes potentially lengthy context (task descriptions, previous reward functions, performance data). At current GPT-4o-mini pricing, a single experiment might cost $5-20 depending on search depth, which adds up quickly during development. The paper claims RF-Agent uses fewer queries than Eureka (a major selling point), but you’re still making enough API calls that rate limits or outages become real concerns. There’s no clear path to using open-source LLMs like Llama or Mixtral—the prompting logic seems tuned for OpenAI’s chat format and you’d need to reimplement the LLM interface layer.

Generalization beyond IsaacGym remains an open question. The system is tightly coupled to IsaacGym’s vectorized environment API and reward function signatures. Adapting it to MuJoCo, PyBullet, or real robot platforms would require substantial refactoring. The reward function format expects specific observation tensors (fingertip positions, object poses, etc.) that won’t exist in other domains. If your task doesn’t fit the “manipulation with clear success criteria” pattern, you’ll be pioneering new territory with limited guidance from the existing codebase.

Verdict

Use if: You’re working on robotics manipulation tasks in IsaacGym or Bi-DexHands where reward engineering has become a bottleneck, you have access to NVIDIA GPUs with sufficient memory for parallel simulation, and you have budget for OpenAI API calls. The tree search approach genuinely improves on naive evolutionary methods, and if reward tuning is costing you weeks of researcher time, the infrastructure investment pays off. The pre-generated reward functions alone provide value for reproducibility research. Skip if: You’re working outside the IsaacGym ecosystem, need a solution that runs on modest hardware, want to use open-source LLMs, or are tackling tasks without clear quantitative success metrics. The tool is purpose-built for a specific workflow and won’t generalize gracefully. With only 3 GitHub stars and recent publication, expect minimal community support and potential breaking changes. For most RL practitioners, manual reward engineering or inverse RL from demonstrations remains more practical until this approach matures and broadens its scope.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/ai-agents/deng-ai-lab-rf-agent.svg)](https://starlog.is/api/badge-click/ai-agents/deng-ai-lab-rf-agent)