Back to Articles

Open Trajectory Gym: Post-Training LLMs to Actually Use Tools Across Multiple Turns

[ View on GitHub ]

Open Trajectory Gym: Post-Training LLMs to Actually Use Tools Across Multiple Turns

Hook

A base Qwen model can't solve a single expert-level cryptographic CTF challenge. After trajectory-aware post-training on the same hardware, it solves 35% of them—not through memorization, but by learning to orchestrate multi-step oracle interactions.

Context

Language models excel at single-shot tasks but struggle with agentic workflows that demand sequential tool use, environment feedback, and iterative refinement. Standard supervised fine-tuning treats each turn independently, ignoring the temporal dependencies that define real agent behavior. You end up with models that hallucinate tool parameters, abandon tasks mid-execution, or fail to incorporate feedback from previous steps.

The gap between "chat with tools" and "agent that accomplishes tasks" requires trajectory-aware training—optimizing over entire execution traces rather than isolated turns. Existing RLHF toolkits like TRL focus on single-turn preference learning, while agent frameworks like LangChain prioritize runtime orchestration without training integration. Open Trajectory Gym bridges this gap with a pipeline specifically designed for multi-turn tool-use post-training, shipping with benchmark integration, distributed rollout infrastructure, and novel prompt evolution techniques that bypass expensive continued RL training.

Technical Insight

Open Trajectory Gym structures post-training as a four-stage pipeline where each component is swappable via YAML configs and adapter protocols. Stage 1 collects agent execution traces from benchmarks—think complete CTF challenge attempts with every tool call, observation, and reasoning step. Stage 2 converts these traces into training datasets with configurable filters (you can train only on successful trajectories or include failures as negative examples).

Stage 3 is where architecture decisions get interesting. The framework runs a three-phase training sequence: supervised fine-tuning via TRL, online RL using a patched SkyRL fork with vLLM inference and RLOO (a variant of PPO optimized for language models), then GEPA—Guided Evolution of Prompts via Agentic reflection. Here's how the modular agent protocol looks:

# Agent adapters follow a standardized interface
class AgentProtocol:
    def execute_turn(self, state, observation):
        # Your agent implementation here
        action = self.model.generate(
            context=state.history,
            tools=state.available_tools,
            observation=observation
        )
        return action
    
    def parse_tool_call(self, action):
        # Extract structured tool invocation
        return ToolCall.from_text(action)

# Benchmark adapters provide environment interface
class BenchmarkAdapter:
    def step(self, tool_call):
        result = self.env.execute(tool_call)
        reward = self.reward_function(result)
        done = self.check_terminal(result)
        return result, reward, done

The key architectural insight is maintaining constant harnesses across all three training phases. Your agent doesn't change its tool-use protocol between SFT and RL—the same parsing logic, the same tool schemas, the same evaluation metrics. This continuity prevents distribution shift where the model learns incompatible behaviors across stages.

The SkyRL integration demonstrates production-grade engineering. Open Trajectory Gym ships with 20 targeted patches addressing vLLM 0.16 compatibility, Ray 2.54 distributed execution bugs, and FSDP2 memory issues. The fork isn't academic—it's the difference between "trains on one GPU" and "scales to multi-node H200 clusters with FP8 serving." Here's the distributed rollout configuration:

rollout:
  backend: ray
  num_workers: 8
  inference:
    engine: vllm
    tensor_parallel: 4
    dtype: fp8
    gpu_memory_utilization: 0.9
  batch_size: 32
  episodes_per_iteration: 256

reward:
  type: trajectory_based
  sparse_terminal: true
  intermediate_checkpoints: ["tool_call_valid", "output_parseable"]

GEPA (Stage 3) introduces a radical departure from typical RL workflows. Instead of continued weight updates, it treats prompts as evolvable parameters. Using DSPy, the system runs agent rollouts, collects failure modes, generates reflections on why the agent failed, then evolves system prompts to address those failure patterns. The CyBench CTF results show GEPA outperforms continued Online RL by ~6% while using 4-35x fewer rollouts. You're optimizing in prompt space rather than weight space—far cheaper when you've already invested compute in SFT+RL.

The framework's domain-agnostic design shines in its benchmark adapter system. The repository includes complete CyBench integration with the BoxPwnr agent (a ReAct-style agent for security challenges), but you can plug in SWE-bench, data analysis tasks, or system administration benchmarks without touching core training code. You define tool schemas in YAML, implement the environment step function, write a reward function, and the pipeline handles trajectory collection, dataset generation, and distributed training.

One underappreciated detail: Open Trajectory Gym decouples model serving from training. During Online RL, vLLM serves inference requests while training happens in separate Ray actors. This architecture enables continuous sampling even during gradient updates—no idle GPUs waiting for the policy network to finish backprop. For research teams with limited hardware, this efficiency translates to 2-3x faster iteration cycles.

Gotcha

The experimental status warning is not boilerplate—APIs genuinely change between releases. The project README explicitly states configs may break, and GitHub issues show users encountering breaking changes in training protocols. If you're building production systems with multi-month timelines, budget time for active maintenance or pin to a specific commit and fork.

Resource requirements are steep. The framework lists 24GB VRAM as minimum (realistically for inference-only experiments), but full Qwen3.5-27B BF16 training demands 140GB+ across multiple GPUs. The Ray distributed setup helps, but you're still looking at multi-GPU rigs or cloud instances with 8x A100s minimum. FP8 quantization helps (and the vLLM integration supports it), but you sacrifice some task performance. The compute economics favor research labs and well-funded startups, not individual developers on gaming rigs. Additionally, the reliance on a forked SkyRL creates long-term maintenance burden—upstream changes won't flow automatically, and the 20 patches suggest non-trivial divergence. You're betting on either the fork being maintained or upstream eventually incorporating these fixes.

Verdict

Use Open Trajectory Gym if you're training agentic models on multi-turn tool-use tasks with access to multi-GPU infrastructure (8+ GPUs recommended) and tolerance for experimental tooling. It's ideal for research teams exploring trajectory-based RL, organizations building domain-specific agents (security, SWE, data analysis) where you need full control over the training stack, or advanced ML engineers comfortable debugging Ray/vLLM integration issues. The GEPA approach alone justifies adoption if you want to iterate on agent behavior without expensive continued RL training. Skip it if you need production-stable APIs, lack substantial GPU resources (24GB VRAM minimum is misleading—realistic use requires 100GB+), want quick prototyping without infrastructure setup, or prefer vendor-supported tools over maintaining forks. For standard RLHF without multi-turn trajectory handling, stick with Hugging Face TRL; for agent deployment without training, use LangChain or AutoGPT.