3DTrajMaster: Choreographing Multi-Entity Motion in Video Generation with 6-DoF Control
Hook
Most AI video generators struggle to keep a character walking in a straight line—3DTrajMaster can orchestrate three separate entities moving through 3D space with independent rotations, all while maintaining visual coherence.
Context
Text-to-video diffusion models have made remarkable progress in generating visually stunning content, but controlling where objects move and how they orient in 3D space remains frustratingly imprecise. You can prompt "a dog running toward the camera" but you can't specify that the dog should start at coordinates (x: -2, y: 0, z: 5), rotate 45 degrees at the midpoint, and end at (x: 0, y: 0, z: 1). This lack of spatial precision makes AI video generation unsuitable for applications requiring choreographed motion—synthetic training data for robotics, pre-visualization for film, or any scenario where you need deterministic rather than probabilistic control.
Existing trajectory-based approaches like DragNUWA provide 2D control through point dragging, but collapse the depth dimension entirely. Camera control tools like CameraCtrl excel at scene-level dynamics but don't address individual object motion. The fundamental challenge is that video diffusion models learn holistic spatiotemporal patterns; disentangling entity-specific motion from scene context while maintaining visual quality requires architectural intervention. KlingAI's 3DTrajMaster addresses this through a plug-and-play injector architecture that fuses 3D trajectory embeddings with entity prompts at multiple points in the diffusion process, enabling full 6-DoF (six degrees of freedom: three translational, three rotational) control over multiple entities without catastrophic interference to the base model's generation capabilities.
Technical Insight
The architecture's brilliance lies in its two-stage training approach and the injector's insertion strategy. Rather than fine-tuning an entire video diffusion model end-to-end—computationally prohibitive and prone to overfitting—3DTrajMaster first trains lightweight LoRA (Low-Rank Adaptation) modules on synthetic trajectory data, then introduces specialized injector modules that perform pair-wise fusion between trajectory embeddings and entity prompt embeddings.
The trajectory representation itself is elegantly simple: each entity's motion is encoded as a sequence of 6-DoF poses (x, y, z coordinates plus roll, pitch, yaw rotations) at each video frame. These are embedded through learned projection layers into the same dimensionality as the diffusion model's internal representation. The injector architecture uses gated self-attention mechanisms inserted at regular intervals throughout the CogVideoX transformer backbone. Here's the conceptual flow:
# Simplified trajectory injection mechanism
class TrajectoryInjector(nn.Module):
def __init__(self, hidden_dim, num_heads=8):
super().__init__()
self.cross_attn = nn.MultiheadAttention(hidden_dim, num_heads)
self.gate = nn.Sequential(
nn.Linear(hidden_dim, hidden_dim),
nn.Sigmoid()
)
def forward(self, entity_features, trajectory_embeddings):
# Pair-wise attention between entity and trajectory
attended, _ = self.cross_attn(
query=entity_features,
key=trajectory_embeddings,
value=trajectory_embeddings
)
# Gated fusion to preserve base model behavior
gate_weight = self.gate(entity_features)
fused = entity_features + gate_weight * attended
return fused
The gating mechanism is critical—it allows the model to learn how much trajectory information to incorporate at each layer, preventing the injector from overwhelming the base model's learned priors about realistic motion and appearance. During early diffusion steps (high noise), the gate tends to prioritize trajectory guidance; during later refinement steps, it increasingly defers to the base model's visual quality.
The training uses a carefully constructed 360°-Motion Dataset that includes edge cases like full 3D occlusion (entities moving behind objects), rotation in place, and complex turning behaviors across 11 distinct camera poses. This diversity is essential because video diffusion models are notoriously sensitive to distribution shift—training only on simple trajectories would fail catastrophically when users request complex choreography.
Inference introduces an annealed sampling strategy that balances trajectory adherence against visual quality:
# Annealed LoRA scale during sampling
for t in diffusion_timesteps:
if t > high_noise_threshold:
lora_scale = 1.0 # Strong trajectory guidance
elif t > mid_noise_threshold:
lora_scale = 0.7 # Gradual transition
else:
lora_scale = 0.4 # Prioritize visual quality
latents = diffusion_step(
latents,
prompt_embeds,
trajectory_embeds,
lora_scale=lora_scale
)
This temporal annealing mirrors classifier-free guidance strategies but operates on the trajectory-specific parameters. Early denoising steps establish spatial layout and motion paths; later steps refine texture, lighting, and fine-grained dynamics.
The system generalizes beyond humanoid motion to diverse entity types—animals, vehicles, robots, even abstract concepts like fire and breeze. This emerges from the pair-wise fusion design: because trajectories are bound to entity prompts through attention rather than hardcoded entity categories, the model learns trajectory semantics independent of entity type. A "spiral upward" trajectory applies equally to a bird, a drone, or smoke, with entity-specific motion characteristics (wing flapping, rotor blur, particle dispersion) handled by the base diffusion model's priors.
Gotcha
The elephant in the room: the publicly released version runs on CogVideoX-5B, while KlingAI's internal implementation uses a proprietary model with significantly better visual quality. This isn't just marketing spin—the authors explicitly acknowledge the quality gap. If you're evaluating 3DTrajMaster based on the open-source checkpoint, you're seeing a proof-of-concept rather than production-grade output. Accessing the internal model requires approval through their application process, which may not be feasible for many research or commercial use cases.
Generalization degrades noticeably with entity count. Single-entity control works reliably, two entities show occasional trajectory drift, and three entities require careful prompt engineering to maintain coherence. The paper's quantitative results confirm this: average trajectory error increases from 0.12 to 0.31 to 0.47 (normalized units) for 1, 2, and 3 entities respectively. The culprit is attention dilution—more entities means each injector must split its representational capacity across more trajectory-prompt pairs. There's also a specific prompt formatting requirement: each entity description must be 15-24 words for optimal performance. Shorter prompts underspecify entity characteristics, longer prompts dilute trajectory binding. This constraint feels arbitrary but emerges from the training data distribution—relaxing it would require retraining with variable-length entity descriptions.
Verdict
Use if: You need deterministic 3D motion control for video generation—choreographed animations, synthetic training data for robotics or autonomous vehicles, pre-visualization for film where camera moves and object trajectories must be precisely coordinated, or research exploring controllable video synthesis. The plug-and-play architecture makes it straightforward to integrate into existing CogVideoX pipelines, and the 6-DoF control unlocks applications impossible with text prompts alone. Skip if: You need production-ready visual quality without access to KlingAI's internal model, your use case involves more than 3 simultaneous entities, or simple 2D trajectory control suffices (DragNUWA would be simpler). Also skip if you're looking for real-time performance—the two-stage inference with annealed sampling adds significant computational overhead beyond standard video diffusion.