VQASynth: Teaching Vision Models Spatial Reasoning Without Measuring a Single Distance
Hook
Vision language models can describe what's in an image with uncanny accuracy, yet ask them how far apart two objects are and they'll confidently hallucinate. VQASynth fixes this by creating spatial reasoning datasets from ordinary photos—no LiDAR required.
Context
The spatial reasoning problem in vision language models isn't subtle. Ask GPT-4V or Claude how many feet separate a coffee cup from a laptop in a photo, and you'll get answers that vary wildly between attempts or contradict basic physics. This isn't because these models lack capability—it's because their training data lacks spatial annotations.
Most vision-language pretraining happens on alt-text, captions, and web-scraped image-text pairs. These describe what objects appear in images, not where they are or how they relate spatially. Humans naturally acquire spatial reasoning through embodied experience, but VLMs have no such grounding. SpatialVLM from Google Research demonstrated that fine-tuning on synthetic spatial data could bridge this gap, but the approach remained locked behind closed doors. VQASynth emerges as the open-source answer: a complete pipeline that transforms any image dataset into rich spatial reasoning training data by reconstructing 3D scenes, extracting spatial relationships, and generating chain-of-thought question-answer pairs.
Technical Insight
VQASynth's architecture orchestrates four specialized models into a coherent spatial understanding pipeline. The process begins with metric depth estimation using VGGT (Visual Geometry Grounded Transformer), which predicts per-pixel depth values in actual metric units rather than relative depth orderings. This distinction matters critically—knowing that a pixel is "closer" tells you nothing about whether an object is 2 feet or 20 feet away.
Once depth is established, SAM2 (Segment Anything Model 2) performs semantic segmentation to isolate individual objects. These segments become 3D point clouds when combined with depth maps, allowing the pipeline to calculate actual spatial relationships. Molmo, a point-prompted vision-language model, generates natural language descriptions of each segmented object when given specific pixel coordinates.
Here's what the dataset generation looks like in practice:
from vqasynth import VQASynthPipeline
# Initialize the pipeline with your model choices
pipeline = VQASynthPipeline(
depth_model="vggt",
segmentation_model="sam2",
caption_model="molmo-7B-D",
device="cuda"
)
# Generate spatial QA pairs from an image
results = pipeline.process_image(
image_path="living_room.jpg",
num_qa_pairs=20,
include_cot=True # Include chain-of-thought reasoning
)
# Results contain structured spatial relationships
for qa in results['qa_pairs']:
print(f"Q: {qa['question']}")
print(f"Reasoning: {qa['reasoning']}")
print(f"A: {qa['answer']}\n")
# Example output:
# Q: How far is the couch from the coffee table?
# Reasoning: The couch occupies the region from (x1, y1) to (x2, y2)
# with an average depth of 3.2m. The coffee table center is at
# depth 2.1m. The horizontal separation is approximately 0.8m.
# A: The couch is approximately 1.4 meters from the coffee table.
The chain-of-thought reasoning isn't cosmetic—it's essential for training models that can explain their spatial judgments. Without it, models learn to pattern-match questions to answers without developing actual geometric understanding.
The templating system generates diverse question types from the extracted spatial graph. Distance queries ("How far is X from Y?"), orientation questions ("Is X to the left or right of Y?"), and compositional spatial reasoning ("What is the closest object to X that is also behind Y?") all derive from the same underlying 3D reconstruction. The templates support customization for domain-specific vocabulary:
# Custom templates for robotics scenarios
custom_templates = [
{
"type": "navigation",
"template": "To reach {object_a} from {object_b}, should the robot move forward or backward?",
"reasoning_template": "Object {object_a} has depth {depth_a}m while {object_b} has depth {depth_b}m. Since {comparison}, the robot should move {direction}."
},
{
"type": "manipulation",
"template": "Can the robot grasp {object_a} without moving {object_b}?",
"reasoning_template": "The clearance between {object_a} and {object_b} is {distance}m. Given a typical gripper width of 0.1m, {conclusion}."
}
]
pipeline.add_templates(custom_templates)
The true architectural cleverness lies in how VQASynth handles uncertainty propagation. Depth estimation isn't perfect—edges are fuzzy, occlusions create artifacts, reflective surfaces confuse sensors. Rather than pretending measurements are precise, the pipeline can generate confidence scores and even include uncertainty in the chain-of-thought reasoning ("The depth estimate has high variance in this region, but the object appears to be between 2-3 meters away").
For training VLMs with the generated data, VQASynth integrates with LoRA (Low-Rank Adaptation) fine-tuning workflows. The repository includes example training scripts that demonstrate how to use the synthetic QA pairs to specialize models like LLaVA or Qwen-VL for spatial reasoning tasks. The authors successfully trained SpaceOm and SpaceThinker models, showing measurable improvements on distance estimation benchmarks—not by making the vision encoder better at depth perception, but by teaching the language model to interpret spatial features the vision encoder already extracts.
Gotcha
VQASynth's Achilles heel is the fundamental limitation of monocular depth estimation. Reconstructing 3D geometry from a single 2D image is an ill-posed problem—multiple 3D scenes can project to identical images. While VGGT performs admirably, it still struggles with transparent objects, mirrors, textureless surfaces, and scenes with unusual lighting. If your depth estimates are garbage, all downstream spatial reasoning inherits that garbage.
The templated question generation, while scalable, produces a distinctly synthetic flavor. Questions follow predictable patterns that don't capture the messy, contextual way humans ask about space. "How far is the couch from the table?" appears in the dataset, but you won't find "If I'm sitting on the couch, can I reach my drink on the table without getting up?"—the kind of embodied, purpose-driven spatial query that matters in real applications. Models trained purely on VQASynth data may excel at benchmark distance estimation but fumble when faced with natural language spatial questions that don't match the templates. The solution involves mixing synthetic data with smaller amounts of human-annotated examples, but that reintroduces the annotation cost VQASynth aimed to eliminate.
Verdict
Use if: You're building embodied AI systems (robots, AR applications, navigation assistants) where spatial reasoning is critical and you need training data at scale. VQASynth shines when you have domain-specific images (warehouse floors, surgical environments, home interiors) but lack spatial annotations. It's also excellent for research into spatial reasoning capabilities, letting you control dataset properties and run ablation studies. Skip if: You need millimeter-precise measurements for engineering applications—use actual depth sensors or LiDAR instead. Also skip if your VLM application doesn't involve spatial reasoning (summarization, OCR, general image description) or if you're working with image types where monocular depth estimation fails catastrophically (microscopy, astronomical imagery, abstract art). Finally, if you need production-ready spatial QA with natural language diversity, you'll need to supplement VQASynth's synthetic data with human annotations rather than relying on it exclusively.