Back to Articles

VQASynth: Teaching Vision Models to Think in 3D With Synthetic Spatial Reasoning Data

[ View on GitHub ]

VQASynth: Teaching Vision Models to Think in 3D With Synthetic Spatial Reasoning Data

Hook

Standard vision-language models can tell you there’s a forklift and some boxes, but ask them how far apart those objects are in meters and they’ll hallucinate confidently wrong answers. The missing ingredient? Training data that teaches spatial reasoning.

Context

Vision-language models have gotten remarkably good at describing what they see—they can identify objects, read text, and even understand complex scenes. But there’s a conspicuous gap in their capabilities: they struggle with spatial reasoning. Ask a model like GPT-4V or Claude how far two objects are from each other in an image, and you’ll get vague responses or wildly inaccurate estimates. This isn’t because the models are fundamentally incapable—it’s because their pretraining data rarely includes metric spatial information.

For embodied AI applications like robotics, warehouse automation, or autonomous navigation, this limitation is critical. A robot needs to know not just that there’s a pallet near a forklift, but exactly how many centimeters separate them. The original SpatialVLM research from proprietary labs demonstrated that you could enhance VLMs with spatial reasoning by fine-tuning them on synthetic datasets that combine 3D scene reconstruction with templated questions. VQASynth is the open-source reproduction of that approach, updated with improved models like VGGT for depth estimation and SAM2 for segmentation. The tool allows you to curate image datasets from HuggingFace Hub and generate rich question-answer pairs grounded in reconstructed 3D geometry.

Technical Insight

Synthesis Pipeline

Scene Understanding

RGB Image Input

VGGT Depth Estimation

SAM2 Segmentation

Depth Map

Object Masks

3D Scene Reconstruction

Molmo Point Prompting

Object-Grounded Captions

Spatial Relationship Calculator

VQA Template Generator

Synthetic QA Pairs + CoT

VLM Fine-tuning with LoRA

System architecture — auto-generated

VQASynth appears to coordinate multiple computer vision models into a scene understanding pipeline based on the codebase. The process begins with monocular depth estimation using VGGT, which the README notes ‘improves metric depth estimation speed & accuracy by replacing DepthPro’. This generates a depth map from a single RGB image with per-pixel distance estimates. Next, SAM2 performs object detection and segmentation, identifying individual objects and their precise boundaries—SAM2 replaced the original SAM in the localization refinement stage according to the README. The key innovation is what happens next: Molmo generates object-grounded captions through point prompting, creating semantic descriptions tied to specific pixel locations.

The pipeline then appears to reconstruct a 3D scene by combining the depth map with segmentation masks. It calculates spatial relationships between objects—distances, orientations, relative positions—using the depth information and object centroids. These geometric calculations become the ground truth for templated VQA generation. The templates ask questions like “How close is the [object A] from the [object B]?” or “Does [object A] appear on the left side of [object B]?” and generate answers with explicit metric values: “The man in the red hat is approximately 60.13 centimeters from the wooden pallet.”

What makes this approach particularly interesting is the chain-of-thought reasoning structure. Each answer is prefixed with a <think> block that shows the model’s reasoning process:

<think>Alright, let me break this down. The man in the red hat is walking 
in a warehouse aisle, and there's a wooden pallet loaded with boxes right 
behind him. The pallet is attached to a manual pallet jack, which appears 
to have been pulled forward...Taking the average stride length of an adult 
male into account (about 0.75 meters or 75 cm), it looks like the distance 
from the man's heel to the nearest edge of the pallet is slightly shorter 
than a full stride.</think>
<answer>The man in the red hat walking is approximately 60.13 centimeters 
from the wooden pallet with boxes.</answer>

This CoT structure teaches the model to reason about spatial relationships systematically—considering reference points like floors and surfaces, comparing object scales, and applying heuristics like typical human dimensions. The README indicates that VLMs trained using VQASynth can ‘apply CoT “thinking” for more robust reasoning and better estimates’. The synthetic dataset generated can then be used to fine-tune VLMs with LoRA adapters, adding spatial reasoning capabilities to the base model.

The repository has produced several datasets on HuggingFace: SpaceOm, SpaceThinker, OpenSpaces variants, and vqasynth_spacellava. These contain image-question-answer triplets spanning different scene types. The trained models demonstrate practical capabilities: SpaceThinker-Qwen2.5VL-3B is noted as having ‘the most accurate distance estimates’, while SpaceOm is described as ‘the best overall’. According to the README, these models can ‘describe distances colloquially, convert between common units of measurement’, ‘answer queries about the orientation and spatial relationships between objects’, and ‘base responses on consistent references like floors and surfaces’.

The architecture appears to be modular based on the component descriptions. The pipeline seems designed to work with image datasets, allowing you to generate domain-specific training data. If you’re building a warehouse robot, feed it warehouse images. If you’re working on indoor navigation, use home interior datasets. The semantic understanding from Molmo adapts to the content, while the geometric calculations remain consistent.

Gotcha

A key consideration is monocular depth estimation quality. Single-image depth prediction faces inherent challenges, as the README’s description of the 3D scene reconstruction pipeline suggests. While VGGT improves depth estimation according to the documentation, single-view depth models can struggle with unusual viewpoints, reflective surfaces, or unfamiliar textures. These potential depth uncertainties would propagate into the distance calculations that become training labels.

The templated VQA generation, while efficient for producing large datasets, may produce somewhat formulaic questions compared to natural language queries. The vocabulary and phrasing in the examples shown (like “Does the red forklift in warehouse appear on the left side of the brown cardboard boxes stacked?”) might not fully reflect how humans naturally ask spatial questions. The example datasets also appear to emphasize yes/no spatial relationship questions and metric distance queries based on the samples provided.

Compute requirements for running the full pipeline—VGGT depth estimation, SAM2 segmentation, and Molmo captioning across many images—will require GPU resources, though specific benchmarks aren’t provided in the README. This isn’t a pipeline you’ll run on a laptop CPU, as it involves multiple heavy vision models in sequence.

Verdict

Use VQASynth if you’re building embodied AI systems that need metric spatial understanding—robots navigating warehouses, AR applications placing virtual objects in real spaces, or assistive technologies describing physical layouts. It’s especially valuable when you have domain-specific imagery where off-the-shelf VLMs fail at spatial queries, and you need to rapidly generate training data without manual annotation. The synthetic approach works well for instruction-tuning existing VLMs with LoRA adapters, adding spatial reasoning capabilities. Skip it if you only need qualitative spatial reasoning (“near,” “far,” “left of”) that base VLMs may already handle reasonably, if you lack GPU resources for the multi-model pipeline, or if you’re working on safety-critical applications where depth estimation uncertainties could cause issues—in those cases, invest in human annotation or real depth sensors instead. The synthetic data approach is valuable for capability development but should be complemented with real-world validation when stakes are high.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/data-knowledge/remyxai-vqasynth.svg)](https://starlog.is/api/badge-click/data-knowledge/remyxai-vqasynth)