VALL-E is a neural codec language model using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather. VALL-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as a prompt.

网站域名:www.microsoft.com 更新日期:2024-05-28 网站简称:VALL-E 网站分类:AI语音克隆 人气指数:146