3DTrajMaster: Six Degrees of Freedom for Multi-Entity Video Generation
Hook
What if you could choreograph not just where objects move in generated videos, but also how they rotate, occlude each other, and turn 180 degrees—all in true 3D space? That’s the promise of 6-DoF video control.
Context
Text-to-video models have made remarkable progress, but they treat motion as a black box. You describe what you want—‘a robot walking through a forest’—and hope the model interprets movement correctly. But what if you need precise control? What if you’re generating training data for robotics, prototyping animation sequences, or creating educational content where spatial accuracy matters?
3DTrajMaster, accepted to ICLR 2025 by researchers at CUHK, Kuaishou Technology, and Zhejiang University, tackles this gap. Unlike existing trajectory control methods that focus on camera movement or simple 2D paths, 3DTrajMaster provides full six-degree-of-freedom (6-DoF) control over entity motion—three for position (x, y, z) and three for orientation (roll, pitch, yaw). According to the project documentation, you can make a car drive forward while rotating 90 degrees, have a human walk behind a tree and emerge on the other side, or choreograph three entities moving independently through complex 3D paths. Built as a plug-and-play extension to existing text-to-video models, it aims to bring spatial precision to generative video.
Technical Insight
The architecture centers on what the team calls a ‘3D-motion grounded object injector’—a module that fuses trajectory embeddings with entity text prompts at strategic points in the video generation pipeline. Built on CogVideoX-5B (with a proprietary internal version also available), the system uses a two-stage training approach that separates trajectory learning from entity-trajectory binding.
The first stage fine-tunes LoRA adapters on synthetic trajectory data from the newly introduced 360°-Motion Dataset, which contains camera poses and trajectory annotations across 11 viewpoints. This teaches the base model to understand 3D motion without breaking its existing capabilities. The second stage trains the injector modules themselves—these perform pair-wise fusion of pose embeddings (representing 6-DoF trajectories) with entity text embeddings through gated self-attention mechanisms. The injectors are inserted at configurable intervals in the transformer backbone, with the README noting that users can set --block_interval to 2 or increase it for a lighter model.
Here’s how inference works with the open-source implementation:
python 3dtrajmaster_inference.py \
--model_path ../weights/cogvideox-5b \
--ckpt_path ../weights/injector \
--lora_path ../weights/lora \
--lora_scale 0.6 \
--annealed_sample_step 20 \
--seed 24 \
--output_path output_example
The lora_scale parameter (0.0-1.0) controls how strongly trajectory constraints influence generation, while annealed_sample_step (0-50) implements an inference strategy: the README notes that ‘a higher LoRA scale and more annealed steps can improve accuracy in prompt generation but may result in lower visual quality.’ This addresses a tradeoff in controllable generation—strict constraint adherence versus visual quality.
The modular design is intentional. By injecting trajectory control at specific transformer blocks rather than rebuilding the entire architecture, 3DTrajMaster remains compatible with future base model improvements. The team notes that injectors can be inserted at different intervals depending on computational budget, though they note that increasing the interval ‘will require a longer training time.’
Entity prompts require careful construction—the system expects 15-24 words (approximately 24-40 tokens after T5 encoding) per entity. The README explicitly suggests using GPT to expand short descriptions: ‘Generate a detailed description of approximately 20 words.’ This is a practical constraint that users need to work within.
The 360°-Motion Dataset itself is available on Hugging Face. The README notes that for both training stages, they use ‘only 11 camera poses and exclude the last camera pose as the novel pose setting’—a design choice for evaluating generalization to unseen viewpoints.
Gotcha
The best version of 3DTrajMaster isn’t publicly available. Due to company policy, Kuaishou hasn’t released their proprietary internal model. The README includes a comparison video (added 2025/01/15) showing differences between the CogVideoX-5B version and the internal model. To access the internal model, you need to submit requests via a Google Sheet or email, providing entity prompts and trajectory descriptions, then wait for the team to generate videos for you. This isn’t a tool you can iterate with freely—it’s a request-based service.
The open-source version has documented limitations. The README states that ‘Generalizable Robustness’ follows the pattern ‘prompt entity number: 1>2>3’, indicating that performance decreases as entity count increases. Entity prompt length must stay within the 15-24 word range specified in the documentation. The training process requires synthetic trajectory data and two stages of fine-tuning, making it resource-intensive to replicate or adapt.
The system’s capabilities are bounded by its training data. While the README lists diverse entities (humans, animals, robots, cars, abstract concepts like fire and breeze) and backgrounds (city, forest, desert, gym, sunset beach, glacier, hall, night city), these represent what the model learned from the 360°-Motion Dataset. The trajectory templates provide structure, but truly custom 3D paths require either matching existing patterns or describing new ones when requesting generation from the internal model.
Verdict
Use 3DTrajMaster if you’re a researcher exploring controllable video generation, need to prototype complex multi-entity spatial interactions, or require 6-DoF motion control for specialized applications like robotics visualization or animation previsualization. The architectural insights—particularly the plug-and-play injector design and annealed sampling strategy—are worth studying if you’re building controllable generative systems. If you can secure access to the internal model through the request process, it provides 6-DoF trajectory control that existing text-to-video models don’t offer. Skip it if you need immediate production access without going through request channels, plan to control more than three entities simultaneously (given the documented 1>2>3 performance ordering), or don’t actually need explicit 3D trajectory specification—simpler camera control methods may suffice for less spatially demanding use cases. The 15-24 word entity prompt requirement and the performance characteristics with multiple entities make this a specialist tool rather than a general-purpose video generator. The gap between request-based internal model access and the open-source CogVideoX-5B implementation is an important factor to consider for your workflow needs.