speech synthesis module 10

Wang et al. Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

This paper introduces the VALL-E model, which frames speech synthesis as a language modelling task in which a sequence of audio codec codes are generated conditionally, given a preceding sequence of text (and a speech prompt).

Zeghidour et al. SoundStream: An End-to-End Neural Audio Codec

There are various other similar neural codecs, including Encodec and the Descript Audio Codec, but SoundStream was one of the first and has the most complete description in this journal paper.

Wang et al. Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Zeghidour et al. SoundStream: An End-to-End Neural Audio Codec

Search this site

Posts

Latest Activity

Search the forums