Mangio-RVC-Fork: When Voice Conversion Meets Ensemble Pitch Detection
Hook
What if the best way to detect pitch in voice conversion isn't choosing the 'best' algorithm, but asking multiple algorithms to vote? Mangio-RVC-Fork bets everything on this ensemble approach.
Context
Voice conversion technology—the ability to make one person's voice sound like another—hinges on accurate pitch detection. The original Retrieval-based Voice Conversion (RVC) project offered a solid foundation with traditional methods like PyWorld's DIO algorithm. But pitch detection is notoriously difficult: different algorithms excel in different scenarios. DIO handles breathy voices well but struggles with vocal fry. CREPE, a deep learning model, shines with clean vocals but chokes on noisy recordings. RMVPE offers robustness but at computational cost.
Most voice conversion tools force you to pick one algorithm and hope for the best. Mangio-RVC-Fork emerged as an experimental playground asking a different question: what if we could combine the strengths of multiple pitch detectors? Built by a solo developer as a research fork of RVC-WebUI, it introduced a hybrid f0 estimation method using nanmedian (a statistical approach that ignores NaN values) across multiple algorithms simultaneously. The project also pushed boundaries by attempting CREPE pitch extraction during training—not just inference—and adding formant shifting capabilities for deeper voice timbre manipulation. It's explicitly unstable, development has stalled, and the creator warns against treating it as 'better' than the original. Yet its architectural experiments reveal fascinating insights about ensemble methods in audio ML.
Technical Insight
The core innovation in Mangio-RVC-Fork lies in its hybrid f0 extraction pipeline. Traditional voice conversion uses a single pitch detector, but this fork runs multiple algorithms in parallel and combines their outputs using numpy's nanmedian function. The architecture looks roughly like this:
# Simplified hybrid f0 extraction concept
import numpy as np
from pitch_extractors import dio_extract, crepe_extract, rmvpe_extract
def hybrid_f0_extraction(audio, sr, hop_length=160):
# Run multiple pitch detectors
f0_dio = dio_extract(audio, sr)
f0_crepe = crepe_extract(audio, sr, hop_length=hop_length)
f0_rmvpe = rmvpe_extract(audio, sr)
# Stack results (shape: [n_methods, n_frames])
f0_stack = np.vstack([f0_dio, f0_crepe, f0_rmvpe])
# Use nanmedian to combine, ignoring unvoiced frames (NaN)
f0_hybrid = np.nanmedian(f0_stack, axis=0)
return f0_hybrid
This ensemble approach provides robustness against individual algorithm failures. If CREPE hallucinates a pitch value during silence, but DIO and RMVPE correctly output NaN, the nanmedian selects the real consensus. If DIO fails on a creaky vocal segment while CREPE and RMVPE agree, the median favors their detection. The tradeoff is computational cost—you're running 3+ pitch detectors instead of one—but for offline processing, the quality improvements can justify the overhead.
The fork also exposes granular control over CREPE variants. You can choose between 'mangio-crepe' (a customized implementation), 'torchcrepe-tiny' (faster, less accurate), or full CREPE models. The crepe_hop_length parameter lets you balance temporal resolution against speed. Lower hop lengths (80-100ms) capture rapid pitch changes in agile vocals but increase processing time; higher values (200ms+) work for spoken word where pitch evolves slowly.
Formant shifting represents another architectural layer. While pitch conversion changes how high or low the voice sounds, formants define the unique resonances that make a voice sound 'chesty' versus 'nasal'. Mangio-RVC-Fork integrates StftPitchShift to manipulate formants independently:
# Formant shifting during inference
from stftpitchshift import StftPitchShift
def convert_with_formant_shift(audio, pitch_shift_semitones, formant_shift_semitones):
shifter = StftPitchShift(
frame_length=2048,
hop_length=512,
sample_rate=44100
)
# Shift pitch while independently controlling formants
shifted = shifter.shiftpitch(
audio,
pitch_shift=pitch_shift_semitones,
quefrency_shift=formant_shift_semitones # Formant manipulation
)
return shifted
This separation matters for gender conversion or age manipulation. Making a male voice sound female isn't just raising pitch—you need to shift formants upward to match female vocal tract resonances. The fork exposes these controls in its Gradio UI, though documentation remains sparse.
The experimental CREPE training feature attempted something ambitious: using CREPE's higher-quality pitch extraction during model training, not just inference. Traditional RVC training uses faster methods like PM (Parselmouth) to generate pitch targets. The hypothesis was that feeding the model more accurate pitch ground truth during training would yield better-quality conversions. However, this feature suffered from memory leaks and only functioned on Paperspace GPU instances, never achieving stability on local Windows or Mac systems. The implementation likely involved CREPE running in the data preprocessing pipeline, generating f0 labels that the discriminator and generator would use during adversarial training—but the leaked GPU memory made it impractical for extended training runs.
The fork maintains RVC's retrieval-based architecture where the model fetches k-nearest-neighbor features from the training dataset during inference to reduce 'tone leakage' (when source speaker characteristics bleed through). The addition of multiple f0 methods layers on top of this retrieval mechanism, giving users experimental knobs to turn while the core VITS-based architecture remains intact.
Gotcha
The developer explicitly labels this project as experimental and unstable, and that warning should be taken seriously. The repository README states features 'have been post-poned' and development appears stalled. This is a one-person research experiment, not a community-maintained project. If you're building production voice conversion systems or need support when things break, this fork will frustrate you.
The CREPE training feature—one of the headline capabilities—has known memory leaks and only works on Paperspace cloud instances. Local users on Windows or Mac cannot use it at all. Even the hybrid f0 methods, while conceptually sound, lack rigorous benchmarking or published comparisons showing they actually outperform single-algorithm approaches. You're essentially beta testing architectural ideas. The fork also dropped Google Colab support that exists in the original RVC-WebUI, making it less accessible for users without local GPU resources or Paperspace subscriptions. For beginners or anyone needing stability, the original RVC-WebUI repository remains the better choice with more active maintenance, broader platform support, and a larger community troubleshooting edge cases.
Verdict
Use if: You're a researcher or advanced practitioner experimenting with ensemble pitch detection methods, need formant shifting capabilities beyond basic pitch conversion, have Paperspace infrastructure and want to test CREPE training (despite instability), or require the CLI interface for batch processing voice conversion jobs where the Gradio UI becomes limiting. The hybrid nanmedian f0 approach is genuinely novel and worth exploring if you're pushing voice conversion quality boundaries and can tolerate instability. Skip if: You need production-ready software with active maintenance, are new to voice conversion and want community support and tutorials, require Google Colab compatibility, or expect your tools to work reliably across Windows/Mac/Linux without platform-specific gotchas. For 95% of users, the original RVC-WebUI offers better stability, documentation, and community support. This fork is a research playground, not a polished product.