VALL-E is a neural codec language model using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather. VALL-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as a prompt. We also extend VALL-E and train a multi-lingual conditional codec language model. VALL-E X can generate high-quality speech in the target language via just one speech utterance in the source language as a prompt while preserving the unseen speaker’s voice, emotion, and acoustic environment.

网站域名:www.microsoft.com 更新日期:2024-05-28 网站简称:VALL-E 网站分类:AI配音(文字转语音) 人气指数:218