4D Gaussian Splatting: How Hexplane Factorization Makes Real-Time Dynamic Scene Rendering Possible
Hook
Training a dynamic neural radiance field used to take 48 hours. 4DGaussians does it in 8 minutes while rendering at 82 FPS—but only if you can capture perfectly synchronized multi-view video.
Context
Neural radiance fields revolutionized 3D reconstruction, but extending them to dynamic scenes hit a fundamental bottleneck: 4D spacetime grids consume exponentially more memory than 3D spatial grids. A 512×512×512×100 grid for a 100-frame sequence requires 13GB just for occupancy—before storing any actual features. D-NeRF and HyperNeRF solved this with implicit MLPs that compress temporal information into network weights, but those methods require hours of training and seconds per frame for rendering.
3D Gaussian Splatting changed the game for static scenes by replacing volumetric rendering with explicit 3D Gaussians rasterized via differentiable splatting. Real-time rendering suddenly became trivial—but the original formulation had no temporal component. 4DGaussians from HUST Vision Lab extends this to dynamic scenes using hexplane factorization, a clever dimensional reduction technique borrowed from K-Planes. The result: training that completes during a coffee break and rendering that maintains 60+ FPS, making dynamic radiance fields practical for interactive applications like VR telepresence and live event capture.
Technical Insight
The core innovation is representing deformation as a function of features sampled from six 2D planes rather than a full 4D grid. For a dynamic scene spanning spatial dimensions (X,Y,Z) and temporal dimension T, instead of storing features in a dense 4D tensor, the system maintains six feature planes: three spatial-temporal (XY-T, XZ-T, YZ-T) and three purely spatial (XY, XZ, YZ). When computing deformation for a Gaussian at position (x,y,z) and time t, the network samples features from all six planes via bilinear interpolation, then combines them through learned fusion.
The architecture follows a two-stage pipeline. First, you initialize static 3D Gaussians from COLMAP point clouds exactly like the original 3DGS method. Each Gaussian has position μ, covariance Σ (parameterized as rotation and scale), opacity α, and spherical harmonic coefficients for view-dependent color. Then you freeze these base parameters and introduce the deformation network—a small MLP that takes concatenated hexplane features and outputs per-Gaussian deltas:
# Simplified deformation sampling
def query_deformation(position, timestamp, hexplanes):
# Sample from 6 planes
xy_t_feat = bilinear_sample(hexplanes['xy_t'], position[0], position[1], timestamp)
xz_t_feat = bilinear_sample(hexplanes['xz_t'], position[0], position[2], timestamp)
yz_t_feat = bilinear_sample(hexplanes['yz_t'], position[1], position[2], timestamp)
xy_feat = bilinear_sample(hexplanes['xy'], position[0], position[1])
xz_feat = bilinear_sample(hexplanes['xz'], position[0], position[2])
yz_feat = bilinear_sample(hexplanes['yz'], position[1], position[2])
# Concatenate and fuse
combined = torch.cat([xy_t_feat, xz_t_feat, yz_t_feat,
xy_feat, xz_feat, yz_feat], dim=-1)
# Small MLP predicts deformation
delta_pos, delta_rot, delta_scale = deformation_mlp(combined)
return delta_pos, delta_rot, delta_scale
# Apply to base Gaussians
deformed_pos = base_position + delta_pos
deformed_rot = compose_rotations(base_rotation, delta_rot)
deformed_scale = base_scale * delta_scale
The memory savings are dramatic. A 256×256×256×100 dense grid needs 1.7GB for 32-dim features. The equivalent hexplane representation needs only (256×256 + 256×256 + 256×256)×100×32 + (256×256 + 256×256 + 256×256)×32 ≈ 157MB—an 11× reduction. The factorization forces features to share information across dimensions, which acts as implicit regularization preventing overfitting to training views.
Coarse-to-fine training is essential. The system starts with low-resolution hexplanes (64×64) for the first 5,000 iterations, learning global motion patterns. Then it doubles resolution to 128×128, refining mid-frequency details. Final iterations use 256×256 planes for fine-grained deformations. This progressive approach mirrors wavelet compression—early iterations establish the DC component while later stages add higher-frequency corrections. Without this schedule, the deformation field gets stuck in local minima where Gaussians barely move from their initial positions.
The custom CUDA rasterizer handles depth-sorting and alpha-blending with temporal awareness. Unlike standard 3DGS which can cache tile assignments across frames, 4DGaussians must recompute visibility for each timestamp since Gaussians move. The implementation uses the depth-diff-gaussian-rasterization module, which extends the original rasterizer with depth output needed for proper occlusion handling:
# Rendering loop for dynamic scene
for timestamp in range(num_frames):
# Query deformations
deformed_gaussians = apply_deformations(
base_gaussians, timestamp, hexplanes, deformation_mlp
)
# Rasterize with depth awareness
rendered_image, rendered_depth = rasterize(
means3D=deformed_gaussians.positions,
shs=deformed_gaussians.sh_coeffs,
scales=deformed_gaussians.scales,
rotations=deformed_gaussians.rotations,
opacities=deformed_gaussians.opacities,
viewmatrix=cameras[timestamp].viewmatrix
)
# Compute loss and backprop through entire pipeline
loss = mse_loss(rendered_image, gt_images[timestamp])
loss.backward() # Gradients flow to hexplane features and MLP
One subtle implementation detail: the codebase downsamples COLMAP point clouds to under 40,000 points before initialization. With static scenes, more points generally improve quality. But for dynamic scenes, excessive initialization causes instability—the deformation network must coordinate hundreds of thousands of Gaussians, and small gradient noise gets amplified into chaotic motion. Starting sparse and letting adaptive densification add Gaussians during training produces far more stable convergence.
The hexplane features themselves are learned end-to-end via backpropagation through the rasterizer. No supervision on motion or correspondence is required. The network discovers deformation patterns purely from photometric reconstruction loss across multiple views and timestamps. This is powerful but also means the representation encodes motion implicitly—you cannot directly manipulate the deformation field without retraining.
Gotcha
The synchronized multi-view requirement is a deal-breaker for many applications. You need at least 4-6 cameras capturing simultaneously with hardware-level frame synchronization. Consumer phone videos won't work. The codebase assumes your camera rig outputs frames with identical timestamps, and even slight desynchronization (>5ms) causes ghosting artifacts because the deformation network tries to explain parallax as motion.
COLMAP preprocessing is where most users hit walls. The provided scripts expect near-perfect structure-from-motion reconstruction. Motion blur, rolling shutter, reflective surfaces, or insufficient texture all cause COLMAP to fail silently—it produces a sparse point cloud that looks reasonable but has subtle scale drift or coordinate system errors that completely break training. The error messages are cryptic ("loss exploded to nan at iteration 3"), and debugging requires manually inspecting COLMAP's camera poses and point cloud in external tools. The repository provides zero guidance on handling common failure modes.
Memory scaling remains problematic despite hexplane efficiency. A 300-frame sequence at 256×256 resolution exhausts 24GB VRAM on an RTX 3090. The codebase provides no temporal batching or streaming mechanisms—you must fit all hexplane features in GPU memory simultaneously. For long captures like dance performances or sports events, you either need A100-class GPUs or must manually split sequences into overlapping chunks and stitch results.
Topology changes produce visible artifacts. If an object enters the frame halfway through the sequence, there are no Gaussians initialized at that location, and the deformation field cannot conjure them from nothing. You see blurry smearing as nearby Gaussians stretch unnaturally to cover the new region. Similarly, objects leaving the frame leave behind ghost Gaussians that fade slowly rather than disappearing cleanly. The architecture fundamentally assumes a fixed set of Gaussians that deform—it cannot handle creation or destruction of scene elements.
Verdict
Use if: You have a synchronized multi-view camera rig (or can rent time at a capture studio), need interactive rendering for VR/AR applications, and your scenes involve continuous deformation rather than topology changes. The training speed and rendering performance are unmatched for this use case. Also use if you're building on top of Gaussian splatting infrastructure and need to add temporal modeling—the hexplane approach integrates cleanly. Skip if: You only have monocular video (use Nerfies or DynIBaR instead), need to handle objects appearing/disappearing (HyperNeRF handles this better), require semantic editing capabilities (consider SC-GS or deformable variants with explicit skeletons), or cannot invest significant time debugging COLMAP preprocessing. This is a research prototype that proves hexplane factorization works brilliantly but needs substantial engineering for production deployment beyond academic datasets.