SPORT: Teaching Multimodal Agents to Self-Improve Without Human Labels
Hook
Training multimodal agents typically requires thousands of human-labeled trajectories. SPORT's researchers at NeurIPS 2025 proved you can skip that entirely—by having agents grade their own homework.
Context
The bottleneck in building capable AI agents has shifted from model capacity to training data. While large language models can follow instructions reasonably well, teaching them to interact with visual environments and execute multi-step tasks requires trajectory data: sequences of observations, actions, and outcomes. The traditional approach involves humans meticulously labeling successful and failed attempts, an expensive process that doesn't scale.
SPORT (Self-improving multimodal agents through PreferenceOptimization and Reinforcement learning with Task synthesis) attacks this problem by closing the loop entirely. Instead of waiting for human annotators, the system generates its own tasks, explores multiple solution paths, uses LLMs as critics to evaluate outcomes, and updates its policy through preference learning. It's autonomous curriculum learning for multimodal agents—the system essentially runs its own training camp, proposing exercises, attempting them multiple ways, and learning from what works.
Technical Insight
SPORT's architecture revolves around four interconnected components that form a self-sustaining improvement cycle. First, the task synthesis module leverages the ShareGPT4V dataset—a collection of diverse images with detailed captions and visual embeddings—to generate multimodal tasks. Rather than relying on fixed benchmarks, an LLM proposes questions and challenges based on image content, creating an endless stream of training scenarios that span visual reasoning, tool use, and code generation.
The step sampling mechanism is where things get interesting. At each decision point during task execution, instead of greedily selecting a single action, SPORT samples multiple candidate actions from the current policy. Think of it as parallel universes: the agent explores 3-5 different approaches simultaneously, executing each to see where it leads. This multi-trajectory exploration is crucial—it provides the comparative data needed for preference learning without requiring ground truth labels.
Here's a conceptual example of how the step sampling might work in code:
# Simplified SPORT step sampling pseudocode
class SPORTAgent:
def explore_step(self, state, num_samples=4):
# Sample multiple candidate actions from current policy
candidate_actions = []
for _ in range(num_samples):
action = self.policy.sample(state, temperature=0.8)
candidate_actions.append(action)
# Execute each action and collect outcomes
outcomes = []
for action in candidate_actions:
next_state, result = self.execute_action(state, action)
outcomes.append({
'action': action,
'state': next_state,
'result': result,
'trajectory': self.current_trajectory + [action]
})
# LLM critic evaluates and ranks outcomes
rankings = self.llm_verifier.rank_outcomes(
context=state,
outcomes=outcomes,
task_goal=self.current_task
)
return rankings # Used for preference optimization
The LLM-based verification system acts as the discriminator in this setup. After the agent explores multiple action branches, a separate LLM (or the same model in a different mode) examines the outcomes and ranks them based on task completion, visual grounding accuracy, and action appropriateness. This is significantly cheaper than human annotation while providing surprisingly reliable training signals—the verifier doesn't need to be perfect, just consistent enough to identify relative quality.
The preference tuning module takes these ranked trajectories and updates the agent's policy using techniques likely similar to Direct Preference Optimization (DPO) or Proximal Policy Optimization (PPO). Instead of traditional supervised learning where the agent tries to mimic gold-standard actions, preference learning teaches the agent to favor better trajectories over worse ones. The loss function effectively pulls the policy toward higher-ranked actions while pushing away from lower-ranked alternatives.
The integration with TongAgent—the underlying multimodal reasoning core—provides the actual execution environment. TongAgent handles vision-language encoding, tool invocation, and code execution, while SPORT's contribution is the self-improvement loop wrapped around it. This separation of concerns means the exploration and learning components could theoretically be adapted to other multimodal agent frameworks.
What makes this architecture compelling is the feedback loop's closure: synthetic tasks prevent distribution mismatch, multi-trajectory sampling provides comparative data, LLM verification avoids human bottlenecks, and preference optimization learns from relative quality rather than absolute labels. The system demonstrated 7-8% improvements across metrics on the GTA benchmark, a substantial gain for multimodal agent tasks that typically show incremental progress.
Gotcha
The repository is essentially a research artifact rather than production software. Documentation is sparse—there's no clear guide on computational requirements, expected training duration, or how to reproduce the NeurIPS results. You'll find basic installation instructions but not the detailed configuration needed to actually run the self-exploration loop on your own data. Several components mentioned in the research appear incompletely implemented or require significant setup of external dependencies that aren't clearly documented.
More fundamentally, the self-improvement loop depends critically on LLM verifier quality. If your verification model can't reliably distinguish good from bad outcomes in your specific domain, the entire preference learning process becomes unreliable—garbage rankings produce garbage policy updates. The paper demonstrates this works for general visual question answering and tool use tasks, but specialized domains (medical imaging, industrial robotics, scientific analysis) might need domain-specific verifiers or human-in-the-loop validation. The system also inherits ShareGPT4V's biases and limitations; if your tasks involve visual scenarios poorly represented in that dataset, task synthesis quality will suffer. With only 20 GitHub stars and what appears to be an early-stage release, expect to debug missing pieces yourself.
Verdict
Use SPORT if you're a researcher working on autonomous agent training, exploring alternatives to human annotation pipelines, or investigating preference-based learning for multimodal systems. The NeurIPS acceptance and architectural ideas are valuable even if you just implement parts of the approach in your own framework. The self-exploration loop concept is genuinely novel and addresses real scalability bottlenecks. Skip if you need production-ready agent infrastructure, comprehensive documentation, or can't afford to reverse-engineer research code. Also skip if you're working in specialized visual domains where LLM verification might not transfer well, or if you lack the computational resources to run iterative trajectory exploration (likely requiring significant GPU hours). For practitioners, wait for a more mature release or consider adapting specific components like the multi-trajectory sampling into existing agent frameworks rather than adopting SPORT wholesale.