The state-of-the-art

The content of this part of the course is updated each year. We will cover the latest developments.

Start Readings Class Finish

Module status: readings and slides are ready for both classes: 2025-03-18 and 2025-03-25

Because we are now covering very recent developments, which change every year, there are no videos for this module. We’ll cover everything in class.

For 2024-25, there will be two classes devoted to the state-of-the-art. Please check the “Readings” and “Class” tabs to see what we’ll cover in each of them.

As a gentle warm-up before the readings, you should watch Simon’s keynote talk from ACM CUI 2024. The audience for that talk was mainly academic and industry researchers working on Conversational User Interfaces: in other words, users of speech synthesis rather than developers.

For the class on 2025-03-18, you need to:

re-read the FastPitch paper – we will use this model as our example for understanding model training
read the SoundStream paper – an example of a neural audio codec
read “Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers” (which is about the VALL-E model) – an example of a Large Speech Language Model
bring copies of the above papers with you to class

For the class on 2025-03-25, you need to:

re-read the VALL-E paper
relax! the following optional readings are to be done after class:
- L. Sun, K. Li, H. Wang, S. Kang and H. Meng, “Phonetic posteriorgrams for many-to-one voice conversion without parallel data training,” in Proc. 2016 IEEE International Conference on Multimedia and Expo (ICME), Seattle, WA, USA. doi: 10.1109/ICME.2016.7552917
- Tobing, P.L., Wu, Y.-C., Hayashi, T., Kobayashi, K., Toda, T. (2019) “Non-Parallel Voice Conversion with Cyclic Variational Autoencoder” in Proc. Interspeech 2019, Graz, Austria. doi:10.21437/Interspeech.2019-2307

FastPitch

Focus on:

the architecture in Figure 1
the pitch predictor
- what does it predict? how is that used by the rest of the model? how is it trained?
the duration predictor
- ditto
how the model is trained as a whole
how inference is performed

You don’t need to fully understand:

any details of how a Transformer works
how the WaveGlow vocoder works: just assume it can generate a waveform from a mel spectrogram
the details of the evaluation

SoundStream

Focus on:

the big idea, expressed in Figure 2
- and some understanding of the architecture in Figure 3
the introduction in Section I
the core idea of converting a waveform into a sequence of symbols using Vector Quantisation
- and some understanding of the more advanced idea of Residual Vector Quantisation in Section III.C

You don’t need to fully understand:

traditional audio codecs in Section II
denoising
training the model using a discriminator

VALL-E

Focus on:

the architecture in Figure 1
the general concept of framing speech generation as language modelling

You don’t need to understand:

Section 3 – just assume the audio codec is the same as SoundStream (although VALL-E uses EnCodec)
Figure 3 which shows how the audio codec codes are generated in a specific order
- first, all the coarsest ones are generated using a recurrent model
- then all the remaining ones are generated all-at-once
- why doesn’t the model generate the set of codes for a single timestep all at once?

Reading

Łańcucki. FastPitch: Parallel Text-to-speech with Pitch Prediction

Very similar to FastSpeech2, FastPitch has the advantage of an official Open Source implementation by the author (at NVIDIA).

Zeghidour et al. SoundStream: An End-to-End Neural Audio Codec

There are various other similar neural codecs, including Encodec and the Descript Audio Codec, but SoundStream was one of the first and has the most complete description in this journal paper.

Wang et al. Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

This paper introduces the VALL-E model, which frames speech synthesis as a language modelling task in which a sequence of audio codec codes are generated conditionally, given a preceding sequence of text (and a speech prompt).

2025-03-18 class format:

FastPitch – case study: model training
SoundStream – learning to encode speech
VALL-E – a Large Speech Language Model

2025-03-25 class format:

VALL-E – a Large Speech Language Model (continued)
Tasks beyond TTS, including Voice Conversion

Demo pages:

Example audio codec: SoundStream
Example Large Speech Language Models:
- VALL-E
- Parler
Example speech editing model: VoiceCraft
Example Voice Conversion models:
- DualVC 3 (ASR+TTS-style architecture, but using an SSL model instead of explicit ASR, and configured to be causal to enable real-time use)
- StreamVC (audio codec-based)

Download the slides for the class on 2025-03-18

Download the slides for the class on 2025-03-25

Here are some more talks from around 2017-2018 that will help you understand how we arrived at the current state-of-the-art. There is some overlap in the material, but each talk is coming from a different angle, and was for a different audience.

You now need to explore the literature for yourself, to find out what the current state-of-the -art is. But you are strongly recommended to develop a good understanding of the approaches covered in the course prior to this point first, so that your understanding has a solid foundation.

Here are the key places to start looking for good papers:

Conferences

Interspeech (the Annual Conference of the International Speech Communication Association) – start with the most recent year and work backwards
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Speech Synthesis Workshop – type “SSW” in the search box
Blizzard Challenge workshops

The state-of-the-art

FastPitch

SoundStream

VALL-E

Reading

Łańcucki. FastPitch: Parallel Text-to-speech with Pitch Prediction

Zeghidour et al. SoundStream: An End-to-End Neural Audio Codec

Wang et al. Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Download the slides for the class on 2025-03-18

Download the slides for the class on 2025-03-25

Conferences

Journals