Back to blog

New TTS Model Learns Voices From Just 3 Seconds Of Audio

Based on research by Alexander H. Liu, Alexis Tacnet, Andy Ehrenberg, Andy Lo, Chen-Yo Sun

Imagine cloning a voice instantly with just three seconds of recording, defying the industry standard that typically demands hours of sample data. This bold leap in artificial intelligence comes from researchers who have engineered Voxtral TTS, a system capable of generating incredibly natural speech across multiple languages from minimal input.

The model relies on a sophisticated hybrid architecture, merging auto-regressive generation for meaning with flow-matching for acoustic texture. At its core lies Voxtral Codec, a speech tokenizer trained using a unique combination of vector quantization and finite scalar quantization. This design allows the system to capture fine-grained details that older models often miss.

In blind tests conducted by native speakers, Voxtral TTS secured a decisive victory over established competitors like ElevenLabs Flash v2.5, winning 68.4% of votes based on its naturalness and expressivity. The breakthrough proves that high-quality voice cloning no longer requires extensive datasets, making personalized audio generation accessible for far wider applications. Researchers have released the model weights under a CC BY-NC license to encourage further exploration and development in this rapidly evolving field.

Source: Voxtral TTS by Alexander H. Liu et al., https://arxiv.org/abs/2603.25551

Source: arXiv:2603.25551

This post was generated by staik AI based on the academic publication above.