The state of the art

The content of this part of the course is updated each year. We will cover the latest developments.
Log in

Module status: not ready

Because we are now covering very recent developments, which change every year, there are no videos for this module. We’ll cover everything in class.

For 2023-24, there will be two classes devoted to the state-of-the-art. Please check the “Readings” and “Class” tabs to see what we’ll cover in each of them.

 

For the class on 2024-03-26, you need to reread the FastPitch paper from last week, and also read the SoundStream paper. Bring copies of both papers with you to class.

For the class on 2024-04-02, read the paper “Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers” (which is about the VALL-E model).

FastPitch

Focus on:

  • the architecture in Figure 1
  • the pitch predictor
    • what does it predict? how is that used by the rest of the model? how is it trained?
  • the duration predictor
    • ditto
  • how the model is trained as a whole
  • how inference is performed

You don’t need to fully understand:

  • any details of how a Transformer works
  • how the WaveGlow vocoder works: just assume it can generate a waveform from a mel spectrogram
  • the details of the evaluation

SoundStream

Focus on:

  • the big idea, expressed in Figure 2
    • and some understanding of the architecture in Figure 3
  • the introduction in Section I
  • the core idea of converting a waveform into a sequence of symbols using Vector Quantisation
    • and some understanding of the more advanced idea of Residual Vector Quantisation in Section III.C

You don’t need to fully understand:

  • traditional audio codecs in Section II
  • denoising
  • training the model using a discriminator

VALL-E

Focus on:

  • the architecture in Figure 1
  • the general concept of framing speech generation as language modelling

You don’t need to understand:

  • Section 3 – just assume the audio codec is the same as SoundStream (although VALL-E uses EnCodec)
  • Figure 3 which shows how the audio codec codes are generated in a specific order (first, all the coarsest ones are generated using a recurrent model, then all the remaining ones are generated all-at-once)

Reading

Łańcucki. FastPitch: Parallel Text-to-speech with Pitch Prediction

Very similar to FastSpeech2, FastPitch has the advantage of an official Open Source implementation by the author (at NVIDIA).

Zeghidour et al. SoundStream: An End-to-End Neural Audio Codec

There are various other similar neural codecs, including Encodec and the Descript Audio Codec, but SoundStream was one of the first and has the most complete description in this journal paper.

Wang et al. Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

This paper introduces the VALL-E model, which frames speech synthesis as a language modelling task in which a sequence of audio codec codes are generated conditionally, given a preceding sequence of text (and a speech prompt).

Here are some more talks that discuss the state-of-the-art. There is some overlap in the material, but each talk is coming from a different angle, and was for a different audience.

You now need to explore the literature for yourself, to find out what the current state of the art is. But you are strongly recommended to develop a good understanding of the approach covered in the course up to and including module 8 first, so that your understanding has a solid foundation.

Here are the key places to start looking for good papers:

Conferences

Journals