Back to Articles

Dora: The 3D Shape VAE That Lets You Choose Your Compression Ratio at Inference Time

[ View on GitHub ]

Dora: The 3D Shape VAE That Lets You Choose Your Compression Ratio at Inference Time

Hook

What if you could decide how much to compress your 3D shapes after training, not before? Dora-VAE lets you use 1,000 tokens or 100,000+ at inference time, even if those lengths were never seen during training.

Context

Training diffusion models for 3D generation has a dirty secret: the latent space bottleneck. While variational autoencoders (VAEs) have become standard for compressing 3D shapes before feeding them to diffusion models, existing approaches force you into an uncomfortable trade-off. Volume-based VAEs like XCube-VAE produce excellent reconstructions but require massive latent spaces—so large that you can only fit 2 samples per GPU during diffusion training. This isn’t just an inconvenience; it’s a fundamental constraint that slows convergence and drives up computational costs.

Dora, developed by researchers at HKUST and Bytedance Seed and accepted at CVPR 2025, takes a different approach. Instead of encoding 3D shapes into fixed-size volumetric grids, it uses point query-based encoding. The key innovation isn’t just that the latent space is more compact than XCube-VAE’s (which requires an average of 64,821 dimensions)—it’s that you can choose your compression ratio at inference time. Want better reconstruction? Use more tokens. Need faster downstream training? Use fewer. The decoder exhibits inference-time scalability, a property that the authors suggest may be lacking in volume-based VAEs.

Technical Insight

Inference

Training

32K uniform + 32K edge points

Variable-length latent codes

256-100K+ tokens

Point queries

Marching cubes

256, 512, 1024, 2048, 4096

Any length, improves quality

3D Input Shape

Point Sampler

Vecset Encoder

Latent Space

Vecset Decoder

TSDF/Occupancy Field

Output Mesh

Progressive Token Scaling

Scalable Token Count

System architecture — auto-generated

Dora’s architecture differs from traditional 3D VAEs in several key ways. First, it processes 3D shapes through point queries rather than volumetric grids. During encoding, it samples points from the input shape using a hybrid strategy: 32,768 uniformly distributed points plus 32,768 points concentrated near salient edges (as shown in the v1.1 specifications). This edge-aware sampling appears designed to help the network focus on geometric features that matter for shape identity.

Second, the training strategy uses progressive token length scaling. Rather than training with a fixed latent code length, Dora trains with variable lengths drawn from a probability distribution. Version 1.1, for example, samples token lengths of [256, 512, 768, 1024, 1280, 2048, 4096] with probabilities [0.1, 0.1, 0.1, 0.1, 0.1, 0.3, 0.2]. This teaches the decoder to work with different compression ratios during training, but the key property emerges at inference: you can use any token length you want, even 10,000 or 100,000+ tokens that were never seen during training. The README states that reconstruction quality improves with token count.

Third, the decoder reconstructs TSDF (Truncated Signed Distance Function) or occupancy fields rather than direct geometry. Here’s what the inference pipeline conceptually looks like (note: this is a conceptual flow, not actual code from the repository):

# Conceptual inference flow (not from repo code)
# 1. Encode shape to compact latent codes
latent_codes = encoder(point_queries, point_features)  # Shape: [N, latent_dim]
# N can be 256, 4096, or even 100000+

# 2. Decode to TSDF/occupancy field
# Query arbitrary 3D points in space
query_points = sample_3d_grid(resolution=256)  # [256^3, 3]
tsdf_values = decoder(latent_codes, query_points)  # Unordered latent codes

# 3. Convert TSDF to mesh using marching cubes
mesh = marching_cubes(tsdf_values, threshold=0.0)

The unordered nature of the latent codes appears to be critical. Because point queries don’t have inherent spatial ordering, the architecture avoids positional encodings. The authors explicitly warn in their training tips: “avoid adding positional encoding to the latent space as it harms convergence.” This is counterintuitive for anyone familiar with transformers, but the rationale seems to be that positional encodings would introduce spurious spatial biases.

The data preprocessing pipeline reveals another architectural decision with practical implications. To convert non-watertight meshes (common in real-world 3D data) to watertight meshes, Dora expands the surface by an epsilon parameter. Version 1.1 uses eps=2/256, while version 1.2 uses the finer eps=2/512. Smaller epsilon values create thinner structures that hug the original geometry more closely, but they’re also harder to reconstruct. As the README notes: “It is more challenging for the network to learn these thinner structures. Dora-VAE 1.1… can generalize well. However, when inferring with thinner structures, such as eps = 2/512, the reconstructed surface may have holes.”

For downstream diffusion training, Dora enables batch sizes of 128 on the same GPU that can only handle 2 samples with XCube-VAE. The authors emphasize progressive training: start with 256-token latents to warm up the diffusion model, then gradually increase token length and model size. They also recommend bf16 mixed precision over fp16 for stability.

Dora-bench, the accompanying benchmark, introduces standardized evaluation protocols with salient edge-aware sampling. The approach concentrates evaluation points near geometric edges where errors are most visible.

Gotcha

Dora’s preprocessing pipeline requires careful attention. The watertight mesh conversion and epsilon parameter tuning aren’t optional niceties—they directly affect what the network can learn. If you have thin structures like wireframes or intricate lattices, you’ll need to train with eps=2/512 (version 1.2) rather than the default eps=2/256. The version 1.2 weights aren’t released yet (marked as TODO in the README), which means you’ll either need to wait or retrain from scratch.

The repository also contains several unfulfilled promises. Dora-bench at 512 resolution isn’t available yet, and version 1.2 model weights are marked as “to do.” The authors mention failed experiments with normal map supervision in FAQ Q2, suggesting the architecture may have limitations in what auxiliary signals it can exploit. If your use case depends on learning from surface normals or other geometric cues beyond TSDF/occupancy, you may hit architectural constraints. The inference-time scalability is impressive, but you still need to decide on a compression ratio for downstream training, and that decision has consequences for both reconstruction quality and diffusion model convergence speed.

Verdict

Use if: You’re training diffusion models for 3D generation and need a compact latent space that won’t bottleneck your GPU memory. Use if you want the flexibility to trade reconstruction quality for training efficiency by adjusting token counts at inference time. Use if you’re willing to work with research-grade code and handle preprocessing quirks like epsilon tuning. Use if you’re evaluating 3D VAE approaches and want a benchmark with edge-aware evaluation. Skip if: You need production-ready code with comprehensive documentation and stable APIs. Skip if you’re working with extremely thin geometric structures and can’t wait for version 1.2 or retrain from scratch. Skip if you require traditional volume-based representations for downstream tasks. Skip if you need auxiliary supervision signals like normal maps that the architecture doesn’t support well.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/llm-engineering/seed3d-dora.svg)](https://starlog.is/api/badge-click/llm-engineering/seed3d-dora)