SPORT: Teaching Multimodal Agents to Self-Improve Without Human Feedback
Hook
What if AI agents could critique their own work and get better without humans labeling thousands of examples? SPORT demonstrates this isn’t science fiction—it’s a working system appearing at NeurIPS 2025.
Context
Training multimodal agents—systems that reason across images, text, and tool use—traditionally requires massive amounts of human feedback. Someone needs to evaluate whether the agent’s actions made sense, whether it chose the right tool, whether its reasoning was sound. This annotation bottleneck is expensive, slow, and fundamentally limits how quickly we can iterate on agent capabilities.
SPORT (described in the README as enabling ‘Iterative Trajectory Exploration for Multimodal Agents’) tackles this head-on by closing the loop entirely with AI feedback. Instead of humans rating agent trajectories, SPORT uses language models as critics to evaluate action quality, then feeds these AI preferences back into the training loop. The system operates on the GTA benchmark—a multimodal reasoning task requiring agents to answer questions by synthesizing information from images and selecting appropriate tools. The README reports 7% improvement in answer accuracy (AnsAcc), 8% in tool accuracy (ToolAcc), and 7% in code execution success (CodeExec) metrics without human annotations.
Technical Insight
SPORT’s architecture implements a four-stage online learning loop. The Task Synthesis stage uses language models to generate diverse multimodal problems, drawing on the ShareGPT4V dataset (which provides image captions and caption embeddings available via Google Drive). According to the README, this creates a continuous stream of training scenarios.
The Step Sampling phase generates multiple candidate trajectories rather than committing to a single action at each decision point. The agent proposes several approaches simultaneously—different tools, different reasoning chains, or different code snippets to accomplish the same sub-goal. This sampling creates the alternatives needed for preference learning.
Step Verification uses an LLM critic to evaluate these alternatives. The README describes this as using ‘LLM as a critic to compare and rank action outcomes.’ The critic evaluates trajectories based on their likelihood of achieving the task goal, generating comparative judgments rather than requiring human ground truth.
The Preference Tuning stage closes the loop by updating the agent policy using these AI-generated preferences. The README describes this as ‘preference-based optimization’ though the specific algorithm isn’t detailed. Based on the project structure, here’s how the training loop likely works:
# Simplified example based on SPORT's documented architecture
from tongagent import SPORTAgent
from closed_loop_verifier import LLMCritic
from data_generation import TaskSynthesizer
# Initialize components
agent = SPORTAgent(config='configs/agent_config.yaml')
critic = LLMCritic()
synthesizer = TaskSynthesizer(data_path='data_generation/sharegpt4v')
# Self-exploration loop
for iteration in range(num_iterations):
# Generate multimodal tasks
tasks = synthesizer.generate_tasks(batch_size=32)
# For each task, sample multiple trajectories
for task in tasks:
trajectories = agent.sample_trajectories(
task=task,
num_samples=4, # Generate candidate action sequences
temperature=0.7
)
# Critic compares trajectories
preferences = critic.rank_trajectories(
task=task,
trajectories=trajectories
)
# Update policy with preference feedback
agent.update_policy(preferences)
The approach’s scalability advantage comes from the critic evaluating trajectories at machine speed rather than requiring human annotators. The GTA benchmark results validate this: the README reports 7% improvement in answer accuracy, 8% in tool accuracy, and 7% in code execution success.
The data preparation pipeline leverages ShareGPT4V caption embeddings stored in the data_generation/sharegpt4v directory (downloaded separately via Google Drive). These embeddings likely enable efficient semantic operations during task synthesis without recomputing vision features repeatedly.
Gotcha
SPORT is a research artifact, not a production-ready library, and this shows in several ways. The README provides minimal usage documentation beyond installation commands (pip install -r requirements.txt) and a project structure overview. There are no runnable examples, no clear entry points for training your own agent, and no explanation of what the scripts in the script/ directory actually do. The README notes that ‘The project uses environment variables for configuration’ and requires setting up .env and configs/agent_config.yaml files, but provides no details about what credentials or parameters these should contain.
The external dependencies are particularly challenging. The README directs users to download ‘images and embeddings’ from Google Drive and place them in ‘data_generation/sharegpt4v’, but there’s no indication of file sizes, directory structure, or how to verify correct integration. The requirement for configuration files is mentioned but not explained—what credentials are needed? Which LLM provider does the critic use? What are the expected configuration fields? Additionally, the project’s limited community engagement (20 stars at time of writing) means there’s minimal ecosystem of issues, pull requests, or forum discussions to help troubleshoot problems.
Verdict
Use SPORT if you’re an AI researcher exploring self-improvement mechanisms for multimodal agents and have the patience to reverse-engineer research code from a NeurIPS paper. The core idea—using LLM critics to generate preferences for agent training—addresses a real scalability bottleneck in agent development. If you’re working on similar benchmarks (GTA or comparable multimodal reasoning tasks), SPORT provides a concrete reference implementation for AI-feedback-driven training loops. The architectural pattern of synthesis-sample-verify-tune described in the README is generalizable and could inspire your own agent training pipelines. Skip SPORT if you need production-ready agent deployment, comprehensive documentation, or want to quickly experiment with self-improving agents in your own domain. The repository is best viewed as a research artifact accompanying an academic paper rather than a maintained open-source library. For practical agent building, you’ll get more mileage from established frameworks like LangChain, AutoGPT, or OpenAI’s Assistants API. Use SPORT for the ideas and architectural insights, not for the code as-is.