Back to Articles

TangoFlux: Generating 30 Seconds of Studio-Quality Audio in 3 Seconds

[ View on GitHub ]

TangoFlux: Generating 30 Seconds of Studio-Quality Audio in 3 Seconds

Hook

While AudioLDM 2 takes approximately 25 seconds to generate 10 seconds of audio, TangoFlux generates 30 seconds of audio in just 3.7 seconds—on the same GPU. The secret isn’t just optimization; it’s a fundamental rethinking of how diffusion models approach audio generation.

Context

Text-to-audio generation has historically been the slower sibling of text-to-image synthesis. While DALL-E and Stable Diffusion can generate images in seconds, audio models have lagged behind—AudioLDM 2 requires 200 diffusion steps and approximately 25 seconds to produce a 10-second clip. This speed barrier has kept text-to-audio out of interactive applications and real-time workflows.

TangoFlux, developed by declare-lab and acknowledging Stability AI, tackles this performance wall by replacing traditional diffusion with rectified flow matching. The result is a model with approximately 515 million parameters that generates 44.1kHz stereo audio in roughly one-seventh the time of its predecessors while maintaining—and often exceeding—their quality metrics. It achieves a Fréchet Distance of 75.1 and a CLAP score of 0.480, outperforming larger models like AudioLDM 2 (712M parameters) and its own predecessor, Tango 2. The model has been accepted to ICLR 2026, indicating peer-reviewed validation of its approach. For researchers and developers building audio synthesis pipelines, TangoFlux represents a significant advancement in text-to-audio generation speed.

Technical Insight

TangoFlux’s architecture combines Diffusion Transformers (DiT) and Multimodal Diffusion Transformers (MMDiT) in what the team calls FluxTransformer blocks. Unlike traditional diffusion models that learn to denoise Gaussian noise over hundreds of steps, TangoFlux learns rectified flow trajectories—straight-line paths in latent space from noise to target audio. This geometric simplification is why the model needs only 50 inference steps instead of 200, directly translating to faster generation times.

The model operates on latent representations encoded by a Variational Autoencoder (VAE) rather than raw waveforms. This VAE compresses audio into a lower-dimensional space, allowing the transformer to work with more manageable representations while targeting 44.1kHz stereo output. The conditioning mechanism is elegantly simple: textual prompts are encoded and fed into the transformer alongside duration embeddings, giving developers explicit control over output length from 1 to 30 seconds.

Using TangoFlux is refreshingly straightforward. The Python API requires just four lines of code to generate audio:

import torchaudio
from tangoflux import TangoFluxInference

model = TangoFluxInference(name='declare-lab/TangoFlux')
audio = model.generate('Hammer slowly hitting the wooden table', steps=50, duration=10)

torchaudio.save('output.wav', audio, 44100)

Notice the explicit duration parameter—this is a key differentiator. While many text-to-audio models generate fixed-length outputs, TangoFlux treats duration as a first-class conditioning input. This design choice enables variable-length generation without retraining and gives developers precise control over output length, essential for applications where timing matters.

The three-stage training pipeline reveals sophisticated thinking about alignment. Stage one is standard pre-training on audio-caption pairs. Stage two fine-tunes on higher-quality data. But stage three—CRPO, or CLAP-Ranked Preference Optimization (detailed in the research paper)—is where TangoFlux gets interesting. According to the README, CRPO iteratively generates synthetic audio candidates for each caption, ranks them using CLAP (Contrastive Language-Audio Pretraining) scores, and constructs preference pairs. These pairs then train the model using Direct Preference Optimization (DPO) loss adapted for flow matching. This iterative synthetic data generation and preference learning loop is conceptually similar to RLHF in language models but tailored for the continuous nature of audio generation. The team has released the CRPO dataset and generation scripts, allowing researchers to reproduce or extend this preference optimization approach.

For developers wanting to integrate TangoFlux into existing pipelines, the CLI offers a zero-code option: tangoflux "Hammer slowly hitting the wooden table" output.wav --duration 10 --steps 50. There’s also a web interface launched via tangoflux-demo for rapid prototyping. The README notes that inference with 50 steps yields the best results in their evaluation, 25 steps provides similar quality at higher speed, and CFG scales between 3.5 and 4.5 produce comparable outputs. This kind of concrete tuning advice is valuable for production deployments where you’re balancing quality against latency budgets.

The training infrastructure uses Hugging Face’s accelerate for multi-GPU setups. Configuration happens through YAML files that specify model hyperparameters and training file paths. For DPO training specifically, the data format requires four fields: chosen, reject, caption, and duration. This structure directly maps to the preference pair concept—each training example presents a preferred audio sample, a rejected alternative, and the shared caption that generated both. The simplicity of this format makes it straightforward to generate custom preference datasets for domain-specific fine-tuning.

Gotcha

The non-commercial, research-only licensing is TangoFlux’s most significant limitation for practical deployment. According to the README, the model operates under UK data copyright exemption and the Stability AI Community License, which the LICENSE section indicates prohibits commercial use without separate licensing arrangements. If you’re building a commercial product—a game audio generator, podcast production tool, or advertising sound effects library—you cannot legally deploy TangoFlux without negotiating commercial terms. This isn’t a minor licensing footnote; it’s a fundamental barrier to production use outside academic and research contexts.

The 30-second generation cap is another hard constraint. While 30 seconds covers many sound effect and ambient audio use cases, it’s insufficient for full music tracks, extended soundscapes, or longer-form audio content. For applications requiring minute-long or longer audio, you’d need to implement chunking strategies or look at alternatives like Stable Audio Open, which the comparison table shows supports 47-second generation despite its slower inference speed.

Compute requirements, while better than many alternatives, still demand modern GPU hardware. The README notes that inference times are observed on A40 GPUs, which are professional-grade accelerators. While the README doesn’t specify minimum GPU requirements, the model’s scale suggests substantial VRAM needs. Training requires multi-GPU setups with accelerate, putting it out of reach for hobbyist developers or small teams without access to cloud GPU resources or on-premise hardware.

Verdict

Use TangoFlux if you’re doing research in text-to-audio synthesis, building academic prototypes, or exploring interactive audio generation where speed is critical and commercial deployment isn’t an immediate concern. The approximately 7x speed improvement over AudioLDM 2 (comparing 24.8s vs 3.7s for generation, though for different durations) unlocks new interaction patterns—potentially enabling real-time sound effect generation in creative tools, rapid iteration in audio design workflows, or interactive installations where users expect immediate audio responses. The quality metrics demonstrate it’s not just fast but achieves competitive results, making it a strong choice for non-commercial projects where generation speed matters. Use it if you’re fine-tuning on domain-specific data, since the team provides training infrastructure and CRPO scripts for custom preference optimization.

Skip TangoFlux if you need commercial licensing for production deployment—the research-only restriction is non-negotiable without separate agreements. Skip it if your use case requires audio longer than 30 seconds and you can’t implement effective chunking strategies. Skip it if you’re working on CPU-only infrastructure or consumer-grade hardware; this model needs GPU acceleration. And skip it if you need broad ecosystem support—while there’s a ComfyUI integration (via third-party developer) and a Replicate deployment mentioned in the README, this is a relatively new project (849 stars) without the extensive tooling that more established models might offer. For commercial music generation specifically, you’ll need to explore alternatives with appropriate licensing. For longer-form general audio with commercial-friendly licensing, consider Stable Audio Open. But for fast, high-quality research and prototyping in the 1-30 second range under research-use terms, TangoFlux offers compelling performance.

// QUOTABLE

While AudioLDM 2 takes approximately 25 seconds to generate 10 seconds of audio, TangoFlux generates 30 seconds of audio in just 3.7 seconds—on the same GPU. The secret isn't just optimization; it'...

[ Tweet This ]
// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/developer-tools/declare-lab-tangoflux.svg)](https://starlog.is/api/badge-click/developer-tools/declare-lab-tangoflux)