Researchers have introduced X-Voice, a compact 0.4B parameter model capable of zero-shot cross-lingual voice cloning in 30 languages. The model utilizes a two-stage training process with a unified International Phonetic Alphabet representation and open-sourced resources. Separately, Mistral AI has released Voxtral TTS, a larger 4B parameter model that combines autoregressive and flow-matching architectures to address the 'expressivity gap' in text-to-speech synthesis. Voxtral TTS generates natural, speaker-faithful speech in 9 languages from short audio prompts and demonstrates strong performance against existing systems. AI
IMPACT New TTS models from academic and commercial labs are improving voice cloning fidelity and multilingual capabilities, potentially enhancing voice agents and audio content creation.
RANK_REASON The cluster contains two distinct research papers/releases detailing new text-to-speech models.
Read on Hugging Face Daily Papers →
- ElevenLabs Flash v2.5
- Hugging Face
- International Phonetic Alphabet
- LEMAS-TTS
- Ministral 3B
- Mistral AI
- Qwen3-TTS
- Voxtral TTS
- Whisper
- X-Voice
AI-generated summary · Google Gemini · from 3 sources. How we write summaries →