This paper introduces the VALL-E model, which frames speech synthesis as a language modelling task in which a sequence of audio codec codes are generated conditionally, given a preceding sequence of text (and a speech prompt).
Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, Furu Wei. “Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers” un-reviewed preprint. DOI: 10.48550/arXiv.2301.02111
Note: this is a preprint, which means it has not been peer reviewed, and therefore might contain errors. The authors chose not to submit it for formal peer review and publication.