LLaVA-CoT: Teaching Vision-Language Models to Show Their Work

Hook

An 11-billion parameter model just beat GPT-4o-mini and Gemini-1.5-pro on multimodal reasoning benchmarks. The secret? It learned to think out loud before answering.

Context

Vision-language models have gotten remarkably good at answering questions about images—until you ask them to explain their reasoning. GPT-4o can tell you there are three apples in a basket, but the path from pixels to answer remains opaque. This black-box behavior becomes problematic when these models fail: did it miscount, misidentify objects, or misunderstand the question entirely?

The research team at Peking University's Yuan Group recognized that the frontier of multimodal AI isn't just about better answers—it's about trustworthy reasoning. While the industry races toward ever-larger models, LLaVA-CoT takes a different approach: explicitly training models to articulate their reasoning process through structured chain-of-thought traces. Accepted at ICCV 2025, this work demonstrates that systematic reasoning can be learned, not just prompted, creating models that explain their visual analysis before reaching conclusions. It's the difference between a student who shows their work and one who only writes the final answer.

Technical Insight

LLaVA-CoT builds on Llama-3.2-Vision by introducing a training regime focused on four-stage reasoning: problem outline, visual interpretation, step-by-step analysis, and supported conclusion. The architecture itself doesn't radically depart from standard vision-language models—it still uses a vision encoder to process images and a language model to generate text. The innovation lies in what it's trained to generate.

The training dataset, LLaVA-CoT-100k, contains meticulously structured reasoning traces that force the model to decompose complex visual problems. Here's what a typical reasoning trace looks like in practice:

# Example reasoning structure from LLaVA-CoT
{
  "image": "geometry_problem.jpg",
  "question": "What is the area of the shaded region?",
  "reasoning": {
    "outline": "Need to find the area of shaded region by identifying shapes and applying geometric formulas",
    "visual_interpretation": "The image shows a square with side length 10 units containing a circle. The shaded region is the area between the square and the inscribed circle.",
    "steps": [
      "Calculate square area: 10 × 10 = 100 square units",
      "Circle is inscribed, so diameter equals square side: d = 10, r = 5",
      "Calculate circle area: π × 5² = 25π ≈ 78.54 square units",
      "Shaded area = square area - circle area: 100 - 25π"
    ],
    "conclusion": "The shaded region has an area of (100 - 25π) square units, approximately 21.46 square units."
  }
}

This structured approach forces the model to commit to interpretations at each stage, making errors easier to diagnose. If the model incorrectly identifies the circle as circumscribed rather than inscribed, you'll see that mistake in the visual_interpretation stage, not buried in an opaque final answer.

The training process uses a carefully curated dataset where reasoning traces were initially generated by more powerful models, then filtered and refined. The repository provides the complete pipeline:

# Training LLaVA-CoT from the repository
git clone https://github.com/PKU-YuanGroup/LLaVA-CoT.git
cd LLaVA-CoT

# Install dependencies
pip install -r requirements.txt

# Download the LLaVA-CoT-100k dataset
# (Contains structured reasoning examples)
python scripts/download_data.py

# Fine-tune on chain-of-thought data
bash scripts/train_llava_cot.sh \
  --model_name_or_path llama-3.2-vision \
  --data_path ./data/llava_cot_100k.json \
  --output_dir ./checkpoints/llava-cot-11b

What makes this particularly clever is the data generation methodology. Rather than trying to handcraft 100,000 reasoning traces, the team used stronger models to generate initial traces, then applied filtering criteria to ensure quality and consistency. This creates a form of distillation where the systematic reasoning capability of large models gets compressed into an 11B parameter model that can run on more modest hardware.

The model's architecture maintains compatibility with the LLaVA family, meaning you can use standard inference patterns but with enhanced output:

from llava.model import LlavaCotForConditionalGeneration
from transformers import AutoProcessor
from PIL import Image

model = LlavaCotForConditionalGeneration.from_pretrained(
    "PKU-YuanGroup/LLaVA-CoT-11B"
)
processor = AutoProcessor.from_pretrained("PKU-YuanGroup/LLaVA-CoT-11B")

image = Image.open("complex_diagram.jpg")
prompt = "Explain the process shown in this flowchart."

inputs = processor(text=prompt, images=image, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=512)

# Output includes structured reasoning trace
response = processor.decode(outputs[0], skip_special_tokens=True)
print(response)
# Output:
# **Outline:** The flowchart depicts a data processing pipeline...
# **Visual Interpretation:** I observe five connected boxes...
# **Step-by-step:** First, raw data enters through...
# **Conclusion:** This flowchart represents a typical ETL process...

The performance gains are substantial—on benchmarks like MathVista and AI2D, the 11B model competes with or exceeds models 8x its size. This suggests that explicit reasoning structure acts as a force multiplier, allowing smaller models to punch above their weight class when tackling problems that benefit from systematic decomposition.

Gotcha

The elephant in the room is dataset quality. LLaVA-CoT's reasoning ability is directly limited by the quality of its training examples, which were themselves generated by existing models. This creates a ceiling effect—the model can learn to mimic the reasoning patterns it was trained on, but struggles with novel reasoning structures or domains not well-represented in the 100k examples. If your use case involves highly specialized visual reasoning (say, analyzing medical imaging or satellite data), you'll likely need to generate domain-specific reasoning traces and fine-tune further.

There's also the acknowledged evaluation confusion around the AI2D benchmark, where the team mixed up AI2D_TEST_NO_MASK and AI2D_TEST variants. While they've been transparent about this issue, it raises questions about the rigor of the reported numbers. In production settings, you should validate performance on your specific tasks rather than relying solely on benchmark claims. Additionally, the reasoning trace generation adds noticeable latency—where a standard VLM might respond in 2-3 seconds, LLaVA-CoT might take 5-8 seconds to work through its structured reasoning. For interactive applications where speed matters, this becomes a real constraint. The model is also fundamentally a research artifact; expect rough edges, incomplete documentation in places, and the need to dig into code to understand implementation details.

Verdict

Use if: You're building applications where explainability matters as much as accuracy—educational tools, scientific analysis platforms, or any system where users need to verify the model's reasoning. The structured output makes it excellent for debugging model behavior and building trust with non-technical stakeholders. It's also a strong choice if you're hardware-constrained but need strong reasoning capabilities; the 11B model delivers competitive performance at a fraction of the computational cost of frontier models. Skip if: You need production-ready stability and comprehensive documentation—this is a research project that requires comfort with rough edges. Also skip if your use case is latency-sensitive (the multi-stage reasoning adds overhead) or if you're working in highly specialized domains far from the training distribution. For those cases, you're better off with established options like GPT-4o or investing in domain-specific fine-tuning of standard VLMs. Finally, if you just need good answers without caring about the reasoning process, the added complexity of structured CoT may not justify the effort.

LLaVA-CoT: Teaching Vision-Language Models to Show Their Work

LLaVA-CoT: Teaching Vision-Language Models to Show Their Work

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

LLaVA-CoT: Teaching Vision-Language Models to Show Their Work

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// RELATED

ds4: The SSD-Streaming Inference Engine That Treats Your Mac's NVMe Like RAM

Nanocoder: The Terminal Coding Agent That Lets You Switch Models Mid-Conversation

Shard: Proving LLM Inference Can Work Across Scattered GPUs and Terrible Internet

Harness-1: Training Search Agents with State Externalization

ds4: The SSD-Streaming Inference Engine That Treats Your Mac's NVMe Like RAM

Nanocoder: The Terminal Coding Agent That Lets You Switch Models Mid-Conversation

Shard: Proving LLM Inference Can Work Across Scattered GPUs and Terrible Internet

// CODEBASE INTELLIGENCE

Best for

Skip when