Back to Articles

Zero-1-to-3: Teaching Stable Diffusion to Understand Camera Geometry

[ View on GitHub ]

Zero-1-to-3: Teaching Stable Diffusion to Understand Camera Geometry

Hook

Text-to-3D models trained on billions of images still can't reliably generate a cube without putting faces on opposite sides. Zero-1-to-3 solved this by teaching diffusion models something surprisingly absent: where the camera is.

Context

The promise of turning single images into 3D models has tantalized computer vision researchers for decades. Early approaches required complex multi-view setups or struggled with ambiguity—a photograph of a chair could represent infinite possible 3D geometries. When diffusion models like Stable Diffusion emerged, researchers immediately attempted to leverage them for 3D generation through techniques like Score Distillation Sampling (SDS), introduced by DreamFusion. The idea was elegant: optimize a 3D representation (like a NeRF) by rendering it from multiple angles and using a 2D diffusion model to judge whether those renderings look realistic.

But there was a critical flaw. Text-to-image models trained on internet photos have never learned explicit geometric relationships between viewpoints. When you prompt "a car" and use it to guide 3D optimization, the model doesn't understand that the front of the car and back of the car should be different. The result is the infamous "Janus problem"—3D models that show the same features (like faces or car headlights) from every angle, as if the object exists in a geometric paradox. Zero-1-to-3, developed by Columbia's Computer Vision Lab and presented at ICCV 2023, addresses this fundamental limitation by explicitly conditioning Stable Diffusion on camera viewpoint transformations, trained on synthetic data where ground-truth geometry is known.

Technical Insight

Zero-1-to-3's core innovation is deceptively simple: it modifies Stable Diffusion's U-Net to accept camera parameters as additional conditioning inputs. Specifically, the model takes a source image encoded into latent space (using Stable Diffusion's VAE encoder) and concatenates it with camera transformation parameters—relative rotation and translation from the source viewpoint to the target viewpoint. These camera parameters are encoded as a 4-channel tensor representing the extrinsic transformation matrix, which is concatenated channel-wise with the 4-channel latent image representation, creating an 8-channel input to the U-Net.

The training process leverages the Objaverse dataset, a collection of over 800,000 3D models with known geometry. For each training sample, the researchers render the same object from multiple viewpoints with known camera poses. The model learns to predict what an object looks like from viewpoint B given an image from viewpoint A and the camera transformation between them. This is fundamentally different from text-conditioned generation—instead of learning correlations between words and visual features, it learns the geometric transformation rules that govern how appearance changes with viewpoint.

Here's a simplified example of how you'd use Zero-1-to-3 for novel view synthesis:

import torch
from zero123 import Zero123Pipeline
from PIL import Image
import numpy as np

# Load the model (requires ~22GB VRAM)
pipeline = Zero123Pipeline.from_pretrained(
    "cvlab-columbia/zero123-xl",
    torch_dtype=torch.float16
).to("cuda")

# Load your input image
input_image = Image.open("chair.png")

# Define camera transformation (rotation around Y-axis by 45 degrees)
# Format: [azimuth, elevation, radius]
camera_params = {
    "azimuth": 45.0,  # degrees
    "elevation": 0.0,
    "radius": 1.0
}

# Generate novel view
output = pipeline(
    input_image=input_image,
    camera_params=camera_params,
    num_inference_steps=75,
    guidance_scale=3.0
)

novel_view = output.images[0]
novel_view.save("chair_rotated_45deg.png")

The real power emerges when combining Zero-1-to-3 with 3D reconstruction techniques. Using Score Distillation Sampling, you can optimize a NeRF representation by rendering it from multiple viewpoints and comparing those renderings against Zero-1-to-3's predictions. The loss function looks something like this conceptually:

# Pseudocode for SDS-based 3D reconstruction
for iteration in range(num_iterations):
    # Sample random camera viewpoint
    camera_pose = sample_random_pose()
    
    # Render NeRF from this viewpoint
    rendered_image = nerf.render(camera_pose)
    
    # Get Zero-1-to-3's prediction for this viewpoint
    # given the original input image
    with torch.no_grad():
        target_image = zero123(
            input_image=reference_image,
            camera_transform=get_transform(reference_pose, camera_pose)
        )
    
    # Compute SDS loss (simplified)
    noise = torch.randn_like(rendered_image)
    timestep = random.randint(0, num_timesteps)
    
    # Get noise prediction from Zero-1-to-3
    noise_pred = zero123.unet(
        latent=encode(rendered_image + noise * sigma[timestep]),
        timestep=timestep,
        camera_cond=camera_transform
    )
    
    # Gradient flows back through the NeRF
    loss = (noise_pred - noise).pow(2).mean()
    loss.backward()
    optimizer.step()

The repository provides multiple checkpoint versions trained for different iteration counts. The 105k checkpoint generalizes better to out-of-distribution real images, while the 300k checkpoint (trained for 6000 A100 GPU hours) achieves higher fidelity on objects similar to Objaverse's training distribution but may overfit. This trade-off is crucial—synthetic training data provides perfect geometric supervision but introduces domain gap.

One architectural detail worth noting: Zero-1-to-3 doesn't predict absolute camera positions but relative transformations. This makes it invariant to the arbitrary coordinate systems of different input images. You're not telling the model "show me this object from camera position [X, Y, Z]" but rather "show me this object rotated 30 degrees and viewed from slightly above." This relative conditioning is what enables zero-shot generalization—the model never needs to know the absolute scale or orientation of objects in your input images.

Gotcha

The GPU memory requirements are substantial and non-negotiable. Zero-1-to-3 XL requires at least 22GB of VRAM for inference, effectively limiting you to RTX 3090, 4090, or professional GPUs. Memory-efficient attention mechanisms help, but the fundamental architecture based on Stable Diffusion's U-Net demands this footprint. Running on smaller GPUs requires aggressive quantization or model distillation, which degrades quality noticeably.

The 3D reconstruction pipeline, while impressive, is finicky and hyperparameter-sensitive. The repository's reconstruction code is more proof-of-concept than production-ready. Parameters like guidance scale, the number of SDS steps per NeRF training iteration, and the sampling strategy for camera viewpoints dramatically affect results. Objects with thin structures (like chair legs) or complex topology often fail to reconstruct properly. The authors acknowledge these hyperparameters aren't extensively tuned, and you'll spend significant time experimenting for each object category. Additionally, Zero-1-to-3 works best for object-centric scenes with clean backgrounds—feed it a cluttered indoor scene or landscape, and the model struggles to understand what should be preserved versus rotated. It was trained on isolated objects rendered against neutral backgrounds, and that bias shows.

Verdict

Use if: You're building 3D content pipelines where geometric consistency matters more than creative flexibility, you have access to high-end GPU infrastructure (RTX 3090+ or cloud instances), you're working with object-centric imagery (products, furniture, vehicles, characters), or you need to bootstrap 3D reconstructions from single images where traditional multi-view photogrammetry isn't feasible. Zero-1-to-3 is particularly valuable when you need novel view synthesis as an intermediate step—generating training data for other models, creating product visualizations from catalog photos, or augmenting datasets with synthetic viewpoints. Skip if: You lack adequate GPU resources (this is non-negotiable), you need real-time inference (generation takes 30-60 seconds per view), you're working with scenes rather than objects, or you want end-to-end 3D reconstruction without manual parameter tuning. For pure text-to-3D creative workflows, tools like Shap-E or Point-E offer faster iteration despite the Janus problem. If you have multiple input views available, traditional NeRF/photogrammetry approaches will produce superior geometry. Zero-1-to-3 shines specifically in the single-image, geometry-critical use case that falls between these extremes.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/developer-tools/cvlab-columbia-zero123.svg)](https://starlog.is/api/badge-click/developer-tools/cvlab-columbia-zero123)