Building a Multilingual Audiobook Pipeline with ebook2audiobook: Voice Cloning, 1158 Languages, and Consumer Hardware
Hook
An audiobook generator that runs on 2GB of RAM, clones voices, and supports more languages than Google Translate? It sounds impossible, but ebook2audiobook’s 18,500+ GitHub stars suggest developers have found something genuinely useful.
Context
The audiobook market has exploded, but production remains expensive and English-centric. Professional narration costs $200-400 per finished hour, and most TTS services support fewer than 50 languages. For researchers working with non-English texts, indie authors self-publishing multilingual content, or accessibility advocates converting academic materials, commercial solutions fall short. Enter ebook2audiobook: a Python pipeline that transforms eBooks into narrated audiobooks using state-of-the-art neural TTS engines, with voice cloning capabilities and support for 1,158 languages. The project targets a specific gap—personal, legal eBook conversion with hardware constraints typical of consumer laptops and workstations. Unlike cloud TTS services that charge per character, this runs entirely on-premise. Unlike professional narration software that assumes enterprise budgets, it requires as little as 1GB VRAM (minimum requirement, 4GB recommended).
Technical Insight
At its core, ebook2audiobook implements a multi-stage pipeline: document ingestion and normalization, TTS synthesis with optional voice cloning, and audio post-processing with metadata injection. The architecture is intentionally modular, supporting eight different TTS engines (XTTSv2, Bark, Fairseq, VITS, Tacotron2, Tortoise, GlowTTS, YourTTS) that can be swapped based on quality-speed-resource tradeoffs.
The ingestion stage converts multiple eBook formats—including .epub, .mobi, .pdf, .txt, .html, .rtf, and even image-based formats like .tiff, .png, and .jpg—into processable text. For scanned documents, OCR preprocessing extracts text from images. The system preserves chapter boundaries and paragraph structure during this conversion. A notable feature is support for SML (Speech Markup Language) tags embedded directly in the source text, giving authors granular control over narration:
[break] — silence (random range 0.3–0.6 sec.)
[pause] — silence (random range 1.0–1.6 sec.)
[pause:N] — fixed pause (N sec.)
[voice:/path/to/voice/file]...[/voice] — switch voice
These tags control pause durations and voice switching for multi-character dialogue—features typically reserved for professional audiobook production software.
The TTS synthesis stage is where hardware acceleration matters. The default XTTSv2 engine appears to use transformer-based architecture optimized for voice cloning. Users upload a reference audio file, and the model fine-tunes its prosody and timbre to match. The implementation supports CUDA, ROCm (AMD GPUs), MPS (Apple Silicon), and Intel XPU acceleration:
# CLI usage example
python ebook2audiobook.py \
--ebook "my_book.epub" \
--voice "reference_voice.wav" \
--engine "xtts" \
--language "en" \
--device "cuda" # or 'mps', 'rocm', 'cpu'
CPU-only mode exists but is noted to be ‘very slow on CPU’ according to the README, which recommends using lower quality TTS engines like YourTTS or Tacotron2 for CPU environments. GPU access is practically mandatory for processing longer books efficiently.
The final stage combines audio segments into audiobook format with embedded chapter markers and metadata. The tool supports nine output formats including .m4b (the de facto audiobook standard), lossless .flac, and web-friendly .aac. Chapter timestamps appear to be preserved from the original eBook structure, enabling players to navigate by section.
Deployment options reflect modern DevOps practices: native installation via pip, Docker containers (with GPU passthrough support), and cloud notebooks for Google Colab and Kaggle. The Gradio web interface provides a GUI entry point, while headless CLI mode enables batch processing. The Docker image handles the complexity of CUDA driver compatibility, ROCm installation, and dependencies.
One architectural feature worth noting: the use of custom fine-tuned XTTSv2 models. Users can upload ZIP files containing model weights, enabling specialized voices. The project mentions fine-tuned preset models for common use cases. Loading custom models appears to be supported:
python ebook2audiobook.py \
--ebook "novel.epub" \
--custom_model "model.zip" \
--engine "xtts"
This approach enables high-quality narration styles without requiring users to understand transformer training.
Gotcha
The README’s prominent disclaimer—‘This tool is intended for use with non-DRM, legally acquired eBooks only’—reveals the primary limitation: DRM-protected content from Kindle, Kobo, or Apple Books won’t work. You’ll need to work exclusively with DRM-free sources like Project Gutenberg or personal manuscripts.
EPUB structure inconsistencies cause headaches. Many eBooks embed tables of contents, copyright pages, and publisher metadata. The tool doesn’t appear to intelligently filter these, so you may hear narration of front matter and copyright notices unless you manually preprocess the file. The README notes that ‘EPUB format lacks any standard structure like what is a chapter, paragraph, preface etc.’ and recommends ‘you should first remove manually any text you don’t want to be converted in audio.’ This adds significant manual work per book.
Voice cloning quality appears inconsistent based on the README’s acknowledgment that results depend heavily on input sample quality and the TTS engine selected. The system appears to provide limited feedback on reference audio quality validation. Background noise, compression artifacts, or non-neutral speech in reference clips likely produce suboptimal output, though the README doesn’t provide detailed diagnostic guidance.
CPU performance is explicitly noted as a limitation. The README states that ‘Modern TTS engines are very slow on CPU, so use lower quality TTS like YourTTS, Tacotron2 etc.’ This presents a stark tradeoff—you either have GPU access or must accept lower-quality TTS output. There’s no middle ground between quality and speed on CPU.
The README includes an important note: ‘Before to post an install or bug issue search carefully to the opened and closed issues TAB to be sure your issue does not exist already.’ This suggests setup and compatibility issues may be common enough to warrant checking existing issues first.
Verdict
Use ebook2audiobook if you’re converting personal, DRM-free eBooks to audiobooks and have access to a GPU. It’s particularly valuable for non-English content—the 1,158-language support via Fairseq MMS is unmatched in open-source TTS. Researchers working with multilingual corpora, accessibility advocates converting academic papers, or indie authors producing audiobook editions of their own work will find this valuable. The voice cloning capability enables applications like generating audiobooks with specific narrator styles or creating character-distinct voices for dialogue-heavy fiction. If you’re comfortable with Docker and meet the hardware requirements (minimum 2GB RAM, 1GB VRAM; recommended 8GB RAM, 4GB VRAM), setup appears straightforward.
Skip it if you need professional-grade audiobook production with guaranteed consistent narrator performance—voice cloning reliability depends significantly on input quality. Avoid if you’re working exclusively with DRM-protected content from major retailers, as this tool explicitly does not support DRM removal. If you lack GPU access and won’t accept the quality limitations of CPU-optimized TTS engines (YourTTS, Tacotron2), cloud services like Amazon Polly or Google Cloud TTS may be more practical alternatives. Finally, be prepared for manual EPUB preprocessing to remove unwanted front matter and metadata before conversion.