Choreographing Multiple Entities in 3D Space: Inside 3DTrajMaster’s Video Generation Architecture
Hook
Most AI video generators can move a camera or animate a single subject, but controlling three entities simultaneously—each with independent 3D positions and orientations, handling occlusions and 180° turns—requires fundamentally different architecture.
Context
Text-to-video generation has advanced rapidly, but motion control remains primitive. You can prompt “a person walking” and get reasonable results, but try specifying “two people walking in opposite directions while a dog circles them, with precise spatial relationships” and existing models fall apart. The problem isn’t just prompting—it’s that diffusion models lack explicit spatial reasoning.
Traditional approaches like MotionCtrl and DragNUWA offer trajectory controls, but they’re limited to 2D paths or single-entity scenarios. When you need cinematic control—previsualizing a shot with multiple actors, each following specific 3D paths with controlled orientations—you need something that understands 6 degrees of freedom (3D position plus 3D rotation) for multiple entities simultaneously. 3DTrajMaster, accepted to ICLR 2025 from KlingAI Research, tackles this by introducing a plug-and-play injector architecture that fuses trajectory embeddings directly into video diffusion transformers.
Technical Insight
3DTrajMaster’s core innovation is its 3D-motion grounded object injector, which performs pair-wise fusion of trajectory data and entity descriptions before injecting them into CogVideoX-5B’s transformer blocks. The architecture doesn’t retrain the entire diffusion model—instead, it uses a two-stage training strategy that preserves the base model’s visual quality while adding spatial control.
The first stage fine-tunes LoRA (Low-Rank Adaptation) modules on synthetic trajectory data from the 360°-Motion Dataset, which contains diverse entities (humans, animals, robots, even abstract concepts like fire and breeze) across varied backgrounds (cities, forests, glaciers). This teaches the model basic trajectory-following behavior without destroying its generative capabilities. The second stage trains the injector modules specifically—these learn to embed entity-specific motion control while the LoRA modules remain frozen.
The injector itself operates at configurable intervals throughout the transformer. By default, it injects motion-aware representations every 2 blocks via gated self-attention. Here’s what the core mechanism looks like from the repository:
# 1. norm & modulate
norm_hidden_states, norm_empty_encoder_hidden_states, gate_msa, enc_gate_msa = self.norm1(hidden_states, empty_encoder_hidden_states, temb)
bz, N_visual, dim = norm_hidden_states.shape
max_entity_num = 3
_, entity_num, num_frames, _ = pose_embeds.shape
# 2. pair-wise fusion of trajectory and entity
attn_input = self.attn_null_feature.repeat(bz, max_entity_num, 50, num_frames, 1)
pose_embeds = self.pose_fuse_layer(pose_embeds)
attn_input[:,:entity_num,:,:,:] = pose_embeds.unsqueeze(-3) + prompt_entities_embeds.unsqueeze(-2)
attn_input = torch.cat((
rearrange(norm_hidden_states, "b (n t) d -> b n t d",n=num_frames),
rearrange(attn_input, "b n t f d -> b f (n t) d")),
dim=2
).flatten(1,2)
# 3. gated self-attention
attn_hidden_states, attn_encoder_hidden_states = self.attn1_injector(
hidden_states=attn_input,
encoder_hidden_states=norm_empty_encoder_hidden_states,
image_rotary_emb=image_rotary_emb,
)
attn_hidden_states = attn_hidden_states[:,:N_visual,:]
hidden_states = hidden_states + gate_msa * attn_hidden_states
This design is genuinely clever: by separating trajectory learning (LoRA stage) from entity-specific motion control (injector stage), the system maintains flexibility. You can adjust the LoRA scale at inference time (0-1 float) to balance trajectory accuracy against visual quality. Higher scales (0.8+) produce more accurate paths but can degrade aesthetics; lower scales (0.4-0.6) look better but drift from specified trajectories.
The inference process also introduces annealed sampling—a configurable step count (0-50, default 20) that gradually reduces trajectory guidance strength. This prevents the rigid, robotic motion you’d get from constant heavy guidance. The repository’s inference script shows this in action:
python 3dtrajmaster_inference.py \
--model_path ../weights/cogvideox-5b \
--ckpt_path ../weights/injector \
--lora_path ../weights/lora \
--lora_scale 0.6 \
--annealed_sample_step 20 \
--seed 24 \
--output_path output_example
The entity prompt constraints are particularly interesting: descriptions must be 15-24 words (~24-40 tokens after T5 embeddings). This isn’t arbitrary—it’s the sweet spot where the injector can reliably fuse entity semantics with trajectory data. Too short and you lose entity identity; too long and the fusion becomes unstable. The repository actually recommends using GPT to expand short prompts to the appropriate length.
What makes this architecture production-relevant is its plug-and-play nature. The injector modules insert at configurable intervals, meaning you can trade computational cost for control granularity. Injecting every block gives maximum precision but doubles training time; increasing the block interval runs faster but with coarser motion control. The default every-2-blocks strikes a practical balance for most use cases.
Gotcha
The repository is transparent about significant limitations. First, the publicly released model is based on CogVideoX-5B, not KlingAI’s proprietary internal model. The comparison videos in the README show a quality gap between the two versions. The company policy preventing public release of their internal model is frustrating for production use cases.
Generalization degrades sharply with entity count. The repository explicitly states robustness follows “1 entity > 2 entities > 3 entities,” with a hard maximum of 3 entities. If your use case requires controlling four or more objects simultaneously, this won’t work. The architecture doesn’t scale beyond three entity-trajectory pairs without degradation.
The prompt engineering constraints are also restrictive. That 15-24 word requirement for entity descriptions (approximately 24-40 tokens) means you can’t just use natural prompts. “A dog” won’t work reliably—you need something like “A golden retriever with fluffy fur, wearing a red collar, running energetically through an open field.” This verbosity requirement adds friction to iteration workflows and makes the system less intuitive than simply typing what you want to see.
Verdict
Use 3DTrajMaster if you’re building research prototypes for trajectory-controllable video generation, especially for multi-entity scenarios like animation previsualization or virtual cinematography where precise 6-DoF control justifies setup complexity. It’s genuinely state-of-the-art for this specific problem, and the CogVideoX-5B version is accessible enough for academic experimentation. The plug-and-play injector design also makes it valuable for understanding how to extend diffusion models with spatial controls without full retraining. Skip it if you need production-ready quality (the internal model isn’t available), require more than 3 simultaneous entities, or want simple 2D motion controls where the 6-DoF overhead isn’t justified. Also skip if you need immediate deployment without custom infrastructure—this is research code that requires environment setup identical to CogVideoX and careful prompt engineering.