Back to Articles

Add-it: Training-Free Object Insertion That Beats Fine-Tuned Models

[ View on GitHub ]

Add-it: Training-Free Object Insertion That Beats Fine-Tuned Models

Hook

A training-free method just beat supervised fine-tuned models at object insertion. Add-it from NVIDIA achieves this by making diffusion models pay attention to three things at once—something they were never explicitly trained to do.

Context

Object insertion is deceptively hard. Anyone can paste a dog onto a photo, but making that dog appear in a plausible location—on the couch, not floating above it—while preserving the original scene’s lighting, perspective, and fine details requires semantic understanding. Existing approaches typically fall into two camps: supervised methods that require expensive task-specific training, or general-purpose editing tools that struggle with natural placement. Paint-by-Example needs reference images. InstructPix2Pix often mangles the original scene. GLIGEN requires manual bounding boxes, forcing users to know exactly where objects should go.

Add-it, developed by researchers at NVIDIA, Tel Aviv University, and Bar-Ilan University and accepted to ICLR 2025, takes a different approach entirely. It’s training-free, working directly with pretrained Stable Diffusion models by extending their attention mechanism to simultaneously consider the source scene, the text prompt, and the image being generated. The result is a method that outperforms supervised approaches on both synthetic and real image benchmarks, winning human preference evaluations over 80% of the time. The team even constructed a new benchmark—the Additing Affordance Benchmark—specifically to test whether inserted objects land in semantically plausible locations, like putting a lamp on a table rather than hovering mid-air.

Technical Insight

The core innovation in Add-it is its weighted extended-attention mechanism that modifies how diffusion models process information during generation. Standard diffusion models condition on text prompts through cross-attention, but Add-it extends this to incorporate three sources: the original scene latents, the text embedding, and the currently-generated latents. This happens without any gradient updates or fine-tuning.

The workflow differs depending on whether you’re working with generated or real images. For generated images, Add-it first creates a source image from your base prompt, then uses that as structural guidance when generating the target with the new object. Here’s what the CLI invocation looks like:

python run_CLI_addit_generated.py \
    --prompt_source "A photo of a cat sitting on the couch" \
    --prompt_target "A photo of a cat wearing a red hat sitting on the couch" \
    --subject_token "hat" \
    --seed_src 6311 \
    --seed_obj 1 \
    --extended_scale 1.05 \
    --structure_transfer_step 2 \
    --blend_steps 15

The structure_transfer_step parameter controls when the model stops copying structure from the source and starts generating freely. Lower values (like 2 for generated images) allow more flexibility. The extended_scale weights how much the extended attention influences the output—values around 1.05-1.1 provide enough guidance without overwhelming the generation. The blend_steps parameter determines which denoising steps blend information from the source image, with [15] being a single step and empty brackets allowing full scene changes.

For real images, the process includes an inversion step to map the photograph into diffusion latent space, then applies the same extended-attention mechanism. The parameters shift slightly—structure_transfer_step increases to 4 and extended_scale to 1.1 for real images because photographs need stronger structural preservation:

python run_CLI_addit_real.py \
    --source_image "images/bed_dark_room.jpg" \
    --prompt_source "A photo of a bed in a dark room" \
    --prompt_target "A photo of a dog lying on a bed in a dark room" \
    --subject_token "dog" \
    --structure_transfer_step 4 \
    --extended_scale 1.1 \
    --blend_steps 18

The localization component determines where objects should appear. Add-it supports multiple backends through the localization_model parameter. The default attention_points_sam uses cross-attention maps to identify regions corresponding to the subject token, then refines these with SAM (Segment Anything Model). Alternative options include pure attention-based localization, attention with bounding boxes fed to SAM, or even grounding-DINO with SAM for language-grounded detection. This modularity means you can swap localization strategies without changing the core insertion mechanism.

What makes this training-free is that it operates entirely through inference-time modifications to the attention layers. The weights of Stable Diffusion remain frozen. Instead, Add-it dynamically adjusts attention outputs by combining keys and values from multiple sources with learned weightings. This is conceptually similar to how classifier-free guidance steers generation by combining conditional and unconditional predictions, but applied to spatial attention rather than noise prediction.

The prompt engineering requirements are specific: your target prompt should be structurally similar to the source prompt but include the new object. The subject token must be a single word that appears in the target prompt. This constraint exists because Add-it uses attention maps from that specific token to determine placement. If you tokenize “red hat” as two tokens, the localization becomes ambiguous. The implementation expects clear semantic signals.

Gotcha

Add-it’s training-free nature is both its strength and weakness. Success depends heavily on seed values—the seed_obj parameter in particular determines the random noise that generates the new object. Unlike deterministic editing tools, you may need to try multiple seeds before getting satisfactory results. The documentation doesn’t provide guidance on seed ranges or strategies, so expect trial-and-error.

Prompt engineering is finicky. Your source and target prompts must be carefully aligned—too different and structural preservation breaks down, too similar and the object may not appear. The single-token requirement for subject_token is limiting. Want to add “red hat”? You’ll need to pick either “hat” or maybe try “redhat” as a compound, hoping the tokenizer treats it as one unit. Multi-word objects require creative workarounds not documented in the README. Parameter tuning can be non-intuitive—if your object doesn’t appear, decrease structure_transfer_step; if it’s in the wrong location, adjust extended_scale. The defaults differ between generated (1.05, step 2, blend [15]) and real images (1.1, step 4, blend [18]), suggesting these interact in complex ways with input type. The blend_steps parameter defaults to single values like [15] or [18], but passing empty brackets changes behavior entirely to “allow changes in the input image”—a significant functional difference buried in optional arguments.

Verdict

Use Add-it if you need state-of-the-art object insertion quality and have time to iterate on parameters. It genuinely beats supervised methods without training costs, making it perfect for research, creative tools where users can regenerate with different seeds, or applications where you can build wrapper logic to automate parameter search. The modular localization system is elegant if you want to experiment with different placement strategies. The Jupyter notebooks provide good starting points for integration into larger pipelines. Skip it if you need deterministic, one-shot results or want a zero-configuration solution. The seed dependency, prompt constraints, and parameter sensitivity mean this is for developers comfortable reading diffusion papers, not for drag-and-drop user interfaces. If your use case is “add object X to 10,000 images overnight,” the manual tuning requirements will frustrate you. For controlled creative applications where quality trumps automation, Add-it delivers results that justify the friction.

// QUOTABLE

A training-free method just beat supervised fine-tuned models at object insertion. Add-it from NVIDIA achieves this by making diffusion models pay attention to three things at once—something they w...

[ Tweet This ]
// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/developer-tools/nvlabs-addit.svg)](https://starlog.is/api/badge-click/developer-tools/nvlabs-addit)