Mistral AI and X-Voice advance multilingual voice cloning with new architectures

By PulseAugur Editorial · [3 sources] · 2026-05-05 21:11

Researchers have introduced X-Voice, a compact 0.4B parameter model capable of zero-shot cross-lingual voice cloning in 30 languages. The model utilizes a two-stage training process with a unified International Phonetic Alphabet representation and open-sourced resources. Separately, Mistral AI has released Voxtral TTS, a larger 4B parameter model that combines autoregressive and flow-matching architectures to address the 'expressivity gap' in text-to-speech synthesis. Voxtral TTS generates natural, speaker-faithful speech in 9 languages from short audio prompts and demonstrates strong performance against existing systems. AI

IMPACT New TTS models from academic and commercial labs are improving voice cloning fidelity and multilingual capabilities, potentially enhancing voice agents and audio content creation.

RANK_REASON The cluster contains two distinct research papers/releases detailing new text-to-speech models.

Read on Hugging Face Daily Papers →

AI-generated summary · Google Gemini · from 3 sources. How we write summaries →

Mistral AI and X-Voice advance multilingual voice cloning with new architectures

COVERAGE [3]

arXiv cs.AI TIER_1 English(EN) · Rixi Xu, Qingyu Liu, Haitao Li, Yushen Chen, Zhikang Niu, Yunting Yang, Jian Zhao, Ke Li, Berrak Sisman, Qinyuan Cheng, Xipeng Qiu, Kai Yu, Xie Chen · 2026-05-08 04:00

X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning

arXiv:2605.05611v1 Announce Type: cross Abstract: In this paper, we present X-Voice, a 0.4B multilingual zero-shot voice cloning model that clones arbitrary voices and enables everyone to speak 30 languages. X-Voice is trained on a 420K-hour multilingual corpus using the Internat…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-07 02:57

X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning

In this paper, we present X-Voice, a 0.4B multilingual zero-shot voice cloning model that clones arbitrary voices and enables everyone to speak 30 languages. X-Voice is trained on a 420K-hour multilingual corpus using the International Phonetic Alphabet (IPA) as a unified represe…
MarkTechPost TIER_1 English(EN) · Asif Razzaq · 2026-05-05 21:11

Closing the ‘Expressivity Gap’: How Mistral’s Voxtral TTS is Redefining Multilingual Voice Cloning with a Hybrid Autoregressive and Flow-Matching Architecture

<p>Voice AI has a dirty secret. Most text-to-speech systems sound fine — until they don’t. They can read a sentence. What they cannot do is mean it. The rhythm is off. The emotion is flat. The speaker sounds like themselves for two seconds, then drifts into generic syntheti…

COVERAGE [3]

X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning

X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning

Closing the ‘Expressivity Gap’: How Mistral’s Voxtral TTS is Redefining Multilingual Voice Cloning with a Hybrid Autoregressive and Flow-Matching Architecture

RELATED ENTITIES

RELATED TOPICS