Module status: readings and slides are ready for both classes: 2025-03-18 and 2025-03-25
Because we are now covering very recent developments, which change every year, there are no videos for this module. We’ll cover everything in class.
For 2024-25, there will be two classes devoted to the state-of-the-art. Please check the “Readings” and “Class” tabs to see what we’ll cover in each of them.
As a gentle warm-up before the readings, you should watch Simon’s keynote talk from ACM CUI 2024. The audience for that talk was mainly academic and industry researchers working on Conversational User Interfaces: in other words, users of speech synthesis rather than developers.
For the class on 2025-03-18, you need to:
- re-read the FastPitch paper – we will use this model as our example for understanding model training
- read the SoundStream paper – an example of a neural audio codec
- read “Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers” (which is about the VALL-E model) – an example of a Large Speech Language Model
- bring copies of the above papers with you to class
For the class on 2025-03-25, you need to:
- re-read the VALL-E paper
- relax! the following optional readings are to be done after class:
- L. Sun, K. Li, H. Wang, S. Kang and H. Meng, “Phonetic posteriorgrams for many-to-one voice conversion without parallel data training,” in Proc. 2016 IEEE International Conference on Multimedia and Expo (ICME), Seattle, WA, USA. doi: 10.1109/ICME.2016.7552917
- Tobing, P.L., Wu, Y.-C., Hayashi, T., Kobayashi, K., Toda, T. (2019) “Non-Parallel Voice Conversion with Cyclic Variational Autoencoder” in Proc. Interspeech 2019, Graz, Austria. doi:10.21437/Interspeech.2019-2307
- MORE COMING SOON
FastPitch
Focus on:
- the architecture in Figure 1
- the pitch predictor
- what does it predict? how is that used by the rest of the model? how is it trained?
- the duration predictor
- ditto
- how the model is trained as a whole
- how inference is performed
You don’t need to fully understand:
- any details of how a Transformer works
- how the WaveGlow vocoder works: just assume it can generate a waveform from a mel spectrogram
- the details of the evaluation
SoundStream
Focus on:
- the big idea, expressed in Figure 2
- and some understanding of the architecture in Figure 3
- the introduction in Section I
- the core idea of converting a waveform into a sequence of symbols using Vector Quantisation
- and some understanding of the more advanced idea of Residual Vector Quantisation in Section III.C
You don’t need to fully understand:
- traditional audio codecs in Section II
- denoising
- training the model using a discriminator
VALL-E
Focus on:
- the architecture in Figure 1
- the general concept of framing speech generation as language modelling
You don’t need to understand:
- Section 3 – just assume the audio codec is the same as SoundStream (although VALL-E uses EnCodec)
- Figure 3 which shows how the audio codec codes are generated in a specific order
- first, all the coarsest ones are generated using a recurrent model
- then all the remaining ones are generated all-at-once
- why doesn’t the model generate the set of codes for a single timestep all at once?
Reading
Łańcucki. FastPitch: Parallel Text-to-speech with Pitch Prediction
Very similar to FastSpeech2, FastPitch has the advantage of an official Open Source implementation by the author (at NVIDIA).
Zeghidour et al. SoundStream: An End-to-End Neural Audio Codec
There are various other similar neural codecs, including Encodec and the Descript Audio Codec, but SoundStream was one of the first and has the most complete description in this journal paper.
Wang et al. Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
This paper introduces the VALL-E model, which frames speech synthesis as a language modelling task in which a sequence of audio codec codes are generated conditionally, given a preceding sequence of text (and a speech prompt).
2025-03-18 class format:
- FastPitch – case study: model training
- SoundStream – learning to encode speech
- VALL-E – a Large Speech Language Model
2025-03-25 class format:
- VALL-E – a Large Speech Language Model (continued)
- Tasks beyond TTS, including Voice Conversion
Demo pages:
- Example audio codec: SoundStream
- Example Large Speech Language Models:
- Example speech editing model: VoiceCraft
- Example Voice Conversion models:
Download the slides for the class on 2025-03-18
Download the slides for the class on 2025-03-25
Here are some more talks from around 2017-2018 that will help you understand how we arrived at the current state-of-the-art. There is some overlap in the material, but each talk is coming from a different angle, and was for a different audience.
- Speech Synthesis: Where did the signal processing go?
- Speaking naturally? It depends who is listening…
- Does ‘end-to-end’ speech synthesis make any sense?
You now need to explore the literature for yourself, to find out what the current state-of-the -art is. But you are strongly recommended to develop a good understanding of the approaches covered in the course prior to this point first, so that your understanding has a solid foundation.
Here are the key places to start looking for good papers:
Conferences
- Interspeech (the Annual Conference of the International Speech Communication Association) – start with the most recent year and work backwards
- IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- Speech Synthesis Workshop – type “SSW” in the search box
- Blizzard Challenge workshops