Text-to-speech technology has advanced dramatically in recent years, but most systems still face limitations when cloning voices without extensive training data.

“Different voice cloning approaches in AR Transformer. (a.) One-shot approach requiring paired text-audio prompt. (b.) Intrinsic zero-shot approach using only untranscribed audio. (c.) Enhanced one-shot approach combining both methods.”

The speaker encoder is jointly trained with the autoregressive model, unlike systems that use pre-trained speaker verification models. This joint training allows the encoder to better capture the specific characteristics needed for high-quality speech synthesis.

This approach offers several advantages:

No transcription needed for reference audio
Cross-lingual synthesis capabilities
More natural prosody as the model isn’t constrained by prompt examples
Flexible voice cloning across 32 languages

While MiniMax-Speech supports both zero-shot and one-shot cloning, its zero-shot capabilities are what truly distinguish it from other systems like VALL-E, CosyVoice 2, and Seed-TTS, which require paired text-audio samples for speaker conditioning.

Enhancing Audio Quality with Flow-VAE

The second major innovation in MiniMax-Speech is its Flow-VAE architecture, which significantly improves audio quality.

Zero-shot voice cloning without transcription

Enhancing Audio Quality with Flow-VAE

Like this: