Back to Articles

Open Trajectory Gym: Training LLM Agents to Solve CTF Challenges Through Real Tool Execution

[ View on GitHub ]

Open Trajectory Gym: Training LLM Agents to Solve CTF Challenges Through Real Tool Execution

Hook

A prompt evolution technique called GEPA can outperform gradient-based reinforcement learning while using 35x fewer compute resources—but only for agentic tasks where the model already understands the domain conceptually.

Context

Large language models can explain cybersecurity concepts fluently, but ask them to actually solve a CTF challenge and they fall apart. The gap isn’t knowledge—it’s execution. Base models fail to maintain context across dozens of tool invocations, lose track of incremental progress, and can’t learn from the feedback loops that expert agents navigate instinctively.

Traditional post-training approaches like RLHF focus on single-turn quality, but agentic tasks demand something different: the ability to orchestrate multi-step workflows where each action depends on observing the previous result. Open Trajectory Gym tackles this by treating agent traces as first-class training data. Instead of training on isolated question-answer pairs, it captures entire episode trajectories—the full sequence of observations, tool calls, and outcomes—then uses three complementary training phases to teach models the workflows that separate functional agents from conversational chatbots.

Technical Insight

Adapters

observations/actions

raw trajectories

environment state

training data

successful traces

full context + rewards

fine-tuned weights

policy updates

scoring

inference

Agent + LLM

Trace Collection

Dataset Converter

Benchmark Adapter

Reward Adapter

Model Adapter

Training Strategy

SFT via TRL

Online RL via SkyRL

System architecture — auto-generated

The framework’s architecture decouples five concerns through adapter protocols: agent implementations, language models, benchmarks, reward functions, and training algorithms. This modularity means you can swap a CTF benchmark for software engineering tasks or replace the model without rewriting orchestration logic.

The pipeline operates in four stages. First, trace collection runs episodes where agents interact with benchmarks, capturing every observation and action. Here’s what a trace adapter interface looks like:

from open_trajectory_gym.adapters import TraceAdapter

class CyBenchTraceAdapter(TraceAdapter):
    def collect_episode(self, agent, env, max_steps=50):
        """Collect single trajectory with tool execution feedback."""
        trajectory = []
        obs = env.reset()
        
        for step in range(max_steps):
            action = agent.act(obs)
            next_obs, reward, done, info = env.step(action)
            
            trajectory.append({
                "observation": obs,
                "action": action,
                "tool_result": next_obs,
                "reward": reward,
                "metadata": info
            })
            
            if done:
                break
            obs = next_obs
            
        return trajectory

Stage two converts trajectories into training datasets. For supervised fine-tuning, successful traces become demonstrations. For online RL, the framework preserves the full context window including tool outputs, enabling the model to learn from execution feedback rather than just syntactic patterns.

Stage three implements three training methods. SFT uses TRL (Transformer Reinforcement Learning) to imitate successful agent traces. The model learns surface patterns—which tools experts invoke, how they parse outputs, typical action sequences. On CyBench CTF challenges, this alone improves solve rates from 12.5% to roughly 20%.

Online RL kicks in next, using a patched SkyRL fork that executes tools during training. Unlike standard RLHF where rewards come from preference labels, here the environment provides rewards: did the SQL injection work? Did the reverse shell connect? The trainer uses RLOO (Reward-weighted Likelihood with Leave-One-Out baseline) for advantage estimation:

# Simplified RLOO advantage calculation
def compute_advantages(trajectories, baseline_values):
    advantages = []
    
    for i, traj in enumerate(trajectories):
        # Leave-one-out baseline: average of all other trajectories
        other_returns = [t.return_value for j, t in enumerate(trajectories) if j != i]
        baseline = np.mean(other_returns)
        
        advantage = traj.return_value - baseline
        advantages.append(advantage)
        
    return advantages

This approach reduces variance compared to single-baseline methods because each trajectory is compared against contemporaneous rollouts rather than a static baseline. The live tool execution matters crucially—the model receives genuine feedback about whether its SQL syntax was valid, not whether a human annotator thought it looked plausible.

The third method, GEPA (Genetic Prompt Evolution with Pareto-based optimization), evolves prompts using DSPy rather than updating weights. It maintains a population of prompt variants, evaluates them on agent performance metrics (solve rate, step efficiency), and uses Pareto optimization to balance competing objectives. The repo’s authors report GEPA achieves ~35% solve rates versus Online RL’s ~29%, while requiring 4-35x fewer environment rollouts. This suggests that for domains where base models already possess latent capability, prompt engineering can surface competence more efficiently than gradient updates.

The production engineering reveals tensions in the RL training ecosystem. The framework maintains a patched SkyRL fork addressing 20 issues: NCCL deadlocks during distributed training, vLLM 0.16 API changes, FSDP2 integration for memory efficiency. These patches aren’t minor—they represent fundamental stability problems when combining vLLM inference servers, Ray distributed execution, and FSDP sharding. The fact that upstream doesn’t handle these cases suggests that agentic RL post-training remains bleeding-edge territory where practitioners need to fork and patch dependencies.

The async Ray-based orchestration parallelizes trajectory collection across multiple environments while maintaining a shared vLLM instance for inference batching. This architecture amortizes model loading costs but introduces complexity around failure handling—if one environment deadlocks, should the entire training run abort or continue with reduced parallelism? The framework opts for resilience, catching exceptions per-episode and continuing, though this means training logs can be misleading about actual compute costs.

Gotcha

The resource requirements are substantial. The featured Qwen2.5-Coder-32B model demands 140GB+ VRAM for Online RL training (multiple sharded instances plus rollout buffers). Even the recommended minimum setup needs 24GB+ and 8+ CPU cores. The framework provides cloud deployment configs, but costs accumulate quickly—expect hundreds to thousands of dollars for meaningful experiments. There’s no gradient checkpointing or quantization-aware training to reduce footprint, positioning this firmly in the research lab rather than hobbyist territory.

The experimental status is honest but limiting. Version 0.1.0 with explicit API instability warnings means downstream code will break. The dependency on patched forks creates a fragile supply chain—if SkyRL upstream diverges significantly, maintaining compatibility becomes a second full-time job. Documentation is sparse beyond the case study, and the single benchmark (CyBench CTF challenges) makes it difficult to assess how techniques generalize to software engineering agents, data analysis workflows, or system administration tasks. The modular architecture theoretically supports these domains, but without reference implementations or validation results, you’re pioneering.

The GEPA performance claims warrant scrutiny. Outperforming Online RL by 6% with 35x efficiency sounds transformative, but the mechanism matters. If GEPA succeeds by better eliciting existing capabilities through prompt engineering, it won’t help models that lack domain knowledge—it’s an optimization technique, not a teaching method. The paper doesn’t clarify whether GEPA’s prompts transfer across model families or require per-model tuning, which would significantly impact practical utility.

Verdict

Use Open Trajectory Gym if you’re a researcher with access to substantial GPU infrastructure (140GB+ VRAM) investigating how trajectory-based post-training improves agentic task performance, especially in domains where tool execution provides clear reward signals. The modular architecture and concrete CTF results offer a legitimate starting point for experiments, and the three-phase training comparison (SFT vs. Online RL vs. GEPA) provides valuable methodology for evaluating different approaches. The framework’s willingness to maintain production patches also benefits researchers who need these capabilities now rather than waiting for upstream stabilization. Skip it if you need production-ready tooling for deployed agents, lack multi-GPU infrastructure, or require stable APIs and comprehensive documentation. The 0.1.0 experimental status, fork dependencies, single-benchmark validation, and resource demands make this unsuitable for teams building production agent systems or individual developers exploring agentic workflows on consumer hardware. For those use cases, consider OpenHands for software engineering agents or LangGraph for production orchestration with established stability.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/ai-agents/westonbrown-open-trajectory-gym.svg)](https://starlog.is/api/badge-click/ai-agents/westonbrown-open-trajectory-gym)