Dreambooth-Stable-Diffusion: Teaching AI Models Your Face with 5 Photos
Hook
In September 2022, researchers showed they could teach an AI model to recognize a specific subject using just 3-5 photos—without catastrophically forgetting everything else it knew. The trick? Training on synthetic regularization images while the model learns your face.
Context
Before Dreambooth, personalizing text-to-image models was either impossible or required massive datasets. Google's Imagen could generate photorealistic images from text, but you couldn't teach it about your dog, your product, or your face. The original Dreambooth paper from Google Research demonstrated few-shot personalization on their proprietary Imagen model, but the weights were locked away in a research lab.
When Xavier Xiao released this implementation in September 2022—just weeks after the Dreambooth paper—it was revolutionary. Stable Diffusion had only been open-sourced a month earlier, and suddenly anyone with access to cloud GPUs could personalize a state-of-the-art diffusion model. This repo adapted Dreambooth's core insight to Stable Diffusion's architecture: fine-tune the entire UNet on a handful of subject images while using class-specific regularization images to prevent the model from forgetting its prior knowledge. It was scrappy, built atop the Textual Inversion codebase with minimal modifications, but it worked.
Technical Insight
The architectural philosophy here diverges sharply from Textual Inversion. Instead of optimizing a new token embedding in the text encoder's vocabulary space (a parameter-efficient ~5KB change), Dreambooth fine-tunes all 860 million parameters of Stable Diffusion's UNet diffusion model. This is the nuclear option—maximum subject fidelity at the cost of training time, memory, and overfitting risk.
The core training loop pairs your subject images with generated regularization images of the same class. If you're training on photos of your dog, the model simultaneously trains on synthetic images of random dogs. This class-preservation loss prevents mode collapse where the model forgets what "dog" means and only knows your specific pet. The implementation uses a rare identifier token (hardcoded as 'sks') to bind the concept: "a photo of sks dog" becomes your training prompt.
Here's the critical training setup from the codebase:
# Subject images: 3-5 photos of your dog
subject_prompt = "a photo of sks dog"
learning_rate = 1e-6 # Conservative to prevent overfitting
steps = 800
# Regularization images: 200+ generated dog images
class_prompt = "a photo of a dog"
class_weight = 1.0 # Equal weighting with subject loss
# Full UNet fine-tuning with gradient checkpointing
model.unet.enable_gradient_checkpointing()
optimizer = AdamW(model.unet.parameters(), lr=learning_rate)
The regularization mechanism is elegant but computationally expensive. Before training, you generate 200-1000 images using the base model with just the class prompt ("a photo of a dog"). During training, each batch contains both subject images ("a photo of sks dog") and regularization images ("a photo of a dog"). The model learns to associate 'sks' with your specific subject while the regularization images anchor the general "dog" concept in the latent space.
Gradient checkpointing trades computation for memory—instead of storing all activations during the forward pass, it recomputes them during backpropagation. This enables training on consumer GPUs with 12-16GB VRAM, though training still takes 15+ minutes on dual A6000s for 800 steps. Without checkpointing, you'd need 40GB+ VRAM to hold the UNet's activation graph.
The implementation's biggest architectural weakness is its monolithic approach. The code modifies main.py from Textual Inversion with minimal refactoring, leaving dead code paths for embedding optimization that never execute. The rare token 'sks' is hardcoded in multiple places rather than parameterized, and there's no automatic token selection from the vocabulary to ensure true rarity. If you want to use a different identifier, you're grepping through Python files and modifying string literals.
The learning rate (1e-6) is notably conservative—an order of magnitude lower than typical fine-tuning scenarios. This reflects the fundamental tension in Dreambooth: high fidelity to your subject requires aggressive adaptation, but that same aggressiveness destroys the model's ability to compose your subject with novel attributes. In practice, the results show strong subject preservation but weaker editability—changing your dog's color or placing it in unusual contexts often fails or produces artifacts.
Gotcha
The hardcoded 'sks' identifier is more than a code smell—it's a conceptual limitation. The Dreambooth paper emphasizes using rare tokens to minimize collision with existing concept associations in the model's latent space. But 'sks' might not be optimally rare, and if multiple people fine-tune models with the same identifier, you can't easily compose or merge their custom concepts. Changing the token requires editing multiple files and regenerating all regularization images, making experimentation tedious.
More fundamentally, full UNet fine-tuning is a blunt instrument. With 860 million parameters updating over 800 steps on just 3-5 images, overfitting is almost inevitable. The model memorizes your exact training images rather than learning a generalizable representation of your subject. Results are impressive when prompts closely match training data ("a photo of sks dog in grass"), but degrade rapidly for novel compositions ("a watercolor painting of sks dog wearing a hat"). The regularization images help but don't fully solve this—you're asking a billion-parameter model to generalize from a handful of examples while not forgetting its original training. The class-preservation loss is a band-aid on a fundamental few-shot learning problem.
Verdict
Use if: You're studying the evolution of diffusion model fine-tuning techniques, need to understand early Dreambooth implementations for research purposes, or are working with the original Stable Diffusion v1.x architecture where modern tooling doesn't apply. This repo is pedagogically valuable—the code is straightforward PyTorch Lightning, and tracing through the training loop teaches you exactly how class-preservation regularization works at a mechanical level.
Skip if: You need production-ready personalization, want memory-efficient training, or value your time. Hugging Face Diffusers offers a cleaner Dreambooth implementation with LoRA support (fine-tuning low-rank adapters instead of full weights, reducing parameters from 860M to ~3M), better documentation, and active maintenance. Kohya's sd-scripts provide industrial-grade training with aggressive optimizations like 8-bit Adam and custom schedulers. For non-technical users, the Automatic1111 WebUI with Dreambooth extensions delivers one-click training. This repo's historical significance is undeniable—it democratized personalized AI art in 2022—but in 2024, it's a museum piece. Learn from it, then use something modern.