Back to Articles

LLaVA-CoT: Teaching Vision Models to Show Their Work

[ View on GitHub ]

LLaVA-CoT: Teaching Vision Models to Show Their Work

Hook

A vision model smaller than GPT-3.5 is outperforming GPT-4o-mini and Gemini-1.5-pro on multimodal reasoning benchmarks—not by being larger, but by learning to explain its thinking step-by-step.

Context

Vision-language models have scaled to hundreds of billions of parameters chasing better performance, but they remain black boxes. Ask GPT-4V to count objects in a complex scene and you get an answer—maybe right, maybe wrong—with no insight into how it arrived there. This opacity becomes critical in domains like medical imaging or scientific diagram interpretation, where the reasoning process matters as much as the final answer.

LLaVA-CoT from PKU’s Yuan Group flips the paradigm. Instead of training models to directly output answers, it teaches them to generate structured reasoning traces first—identifying relevant visual elements, working through logical steps, then arriving at conclusions. Trained on a curated 100k dataset of examples with explicit reasoning chains, the model demonstrates that systematic thinking can be learned, not just emerged. Accepted at ICCV 2025, it represents a shift from pure scaling to architectural intentionality in multimodal AI.

Technical Insight

Inference Process

Training Pipeline

Visual embeddings

Reasoning traces

Structured examples

Trains

Reasoning chain

Reasoning chain

Reasoning chain

Final output

Image + Question Input

Vision Encoder

Llama-3.2 Language Model

GPT-4V Generator

LLaVA-CoT-100k Dataset

Filter & Validate

Supervised Fine-tuning

Problem Decomposition

Visual Info Extraction

Step-by-Step Logic

Final Answer

Interpretable Answer

System architecture — auto-generated

LLaVA-CoT’s innovation lies in its training methodology. The team created LLaVA-CoT-100k, a dataset where each visual question is paired with structured reasoning traces following a consistent pattern: problem decomposition, visual information extraction, step-by-step logic, and final conclusion. The model appears to build on Llama-3.2-Vision based on the model naming (Llama-3.2V-11B-cot), learning reasoning structure through supervised fine-tuning to make chain-of-thought reasoning spontaneous rather than prompt-engineered.

The inference process differs markedly from standard vision-language models. When you query LLaVA-CoT, it doesn’t immediately predict an answer token. Instead, it first generates a reasoning chain in natural language. The README demonstrates this with examples showing the model outlining problems, interpreting image information, proceeding step-by-step through reasoning, and reaching supported conclusions.

The training process leverages their open-source dataset generation pipeline available at dataset_generation/generate.py. They created LLaVA-CoT-100k by prompting stronger models (GPT-4V) to generate reasoning traces for existing vision-language datasets, then filtering and validating outputs. This generation script allows researchers to create similar datasets for domain-specific applications—a medical imaging researcher could generate reasoning traces for diagnostic tasks, or a robotics team could create step-by-step visual planning examples.

What makes this approach work is the consistency of reasoning structure across training examples. The model learns templates for articulating its reasoning process. The 11B parameter count appears sufficient because the model isn’t learning to reason from scratch—it’s learning to articulate reasoning it already has latent capacity for.

The repository includes complete training code released on January 8, 2025 (as noted in the README), covering the pipeline from data preprocessing through distributed training. The pretrained weights are available on Hugging Face, and a Gradio demo allows interactive testing of the reasoning capabilities.

Gotcha

The project’s transparency about benchmarking errors is commendable but concerning. In January 2025, the team disclosed they’d been evaluating on AI2D_TEST_NO_MASK instead of the standard AI2D_TEST benchmark, only discovering this after publication. The README states: ‘We discovered that when testing with the AI2D benchmark, we were using AI2D_TEST_NO_MASK, while the VLMEvalKit utilizes AI2D_TEST. We previously overlooked the distinction between the two, and we sincerely apologize for this oversight.’ While they acknowledged the mistake publicly and committed to corrections, it raises questions about evaluation rigor across their other reported metrics. The README shows impressive numbers beating GPT-4o-mini and Gemini-1.5-pro, but without independent replication, treat these claims cautiously.

Deployment considerations matter more than the benchmarks suggest. Generating reasoning traces before every answer means higher latency and token costs compared to direct-answer models. If you’re building a production application where users expect sub-second responses, the reasoning overhead becomes prohibitive. The model appears to use Llama-based architecture based on naming conventions—the README shows an Apache 2.0 license for code but doesn’t explicitly detail model licensing terms for commercial applications. Documentation remains sparse on specific training hyperparameters, convergence time, and precise hardware requirements beyond the availability of training code. For researchers working with limited compute budgets, more detailed resource specifications would be helpful.

Verdict

Use LLaVA-CoT if you’re building applications where reasoning transparency matters more than raw speed—educational tools that need to show students problem-solving steps, scientific analysis systems where experts need to verify model logic, or complex visual QA where answer justification is critical. The 11B parameter size suggests it may be deployable on capable GPU workstations, and the open training code plus dataset generation scripts let you adapt the approach to domain-specific reasoning tasks. It’s particularly valuable if you’re researching how to make vision models more interpretable without scaling to trillion-parameter behemoths. The pretrained weights are available on Hugging Face (Xkev/Llama-3.2V-11B-cot) with an interactive Gradio demo for testing. Skip it if you need production-grade reliability given the acknowledged benchmarking issues, require minimal-latency inference for user-facing applications (the reasoning traces add overhead), or need explicitly documented commercial licensing terms. Also skip if you’re doing straightforward classification or detection tasks where reasoning chains add overhead without value—stick with standard vision-language models for those cases. The real innovation here isn’t just the benchmark numbers; it’s demonstrating that systematic reasoning can be taught through careful dataset design and training methodology rather than requiring massive scale alone.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/llm-engineering/pku-yuangroup-llava-cot.svg)](https://starlog.is/api/badge-click/llm-engineering/pku-yuangroup-llava-cot)