Readings

For the class on 2024-03-26, you need to reread the FastPitch paper from last week, and also read the SoundStream paper. Bring copies of both papers with you to class.

For the class on 2024-04-02, read the paper “Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers” (which is about the VALL-E model).

FastPitch

Focus on:

  • the architecture in Figure 1
  • the pitch predictor
    • what does it predict? how is that used by the rest of the model? how is it trained?
  • the duration predictor
    • ditto
  • how the model is trained as a whole
  • how inference is performed

You don’t need to fully understand:

  • any details of how a Transformer works
  • how the WaveGlow vocoder works: just assume it can generate a waveform from a mel spectrogram
  • the details of the evaluation

SoundStream

Focus on:

  • the big idea, expressed in Figure 2
    • and some understanding of the architecture in Figure 3
  • the introduction in Section I
  • the core idea of converting a waveform into a sequence of symbols using Vector Quantisation
    • and some understanding of the more advanced idea of Residual Vector Quantisation in Section III.C

You don’t need to fully understand:

  • traditional audio codecs in Section II
  • denoising
  • training the model using a discriminator

VALL-E

Focus on:

  • the architecture in Figure 1
  • the general concept of framing speech generation as language modelling

You don’t need to understand:

  • Section 3 – just assume the audio codec is the same as SoundStream (although VALL-E uses EnCodec)
  • Figure 3 which shows how the audio codec codes are generated in a specific order (first, all the coarsest ones are generated using a recurrent model, then all the remaining ones are generated all-at-once)

Reading

Łańcucki. FastPitch: Parallel Text-to-speech with Pitch Prediction

Very similar to FastSpeech2, FastPitch has the advantage of an official Open Source implementation by the author (at NVIDIA).

Zeghidour et al. SoundStream: An End-to-End Neural Audio Codec

There are various other similar neural codecs, including Encodec and the Descript Audio Codec, but SoundStream was one of the first and has the most complete description in this journal paper.

Wang et al. Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

This paper introduces the VALL-E model, which frames speech synthesis as a language modelling task in which a sequence of audio codec codes are generated conditionally, given a preceding sequence of text (and a speech prompt).