Module status: see individual classes below
Because we are now covering very recent developments, which change every year, there are no videos for this module. We’ll cover everything in class.
- 2026-03-10 Neural speech processing (vocoders; audio codecs; representation learning)
- Status: ready
- we need to revisit representations of both text and speech; the key advance will be to find a discrete representation of speech
- 2026-03-17 Large Speech Language Models
- Status: ready
- a discrete representation of speech will enable us to use models that can only generate discrete representations: language models
- 2026-03-24 Beyond Text-to-Speech (cloning, conversion, anonymisation,…)
- Status: not ready
- yes, there is more to life than TTS! We don’t have to limit ourselves to textual input!
Status: see individual classes below
As a gentle warm-up before the readings, you could watch Simon’s keynote talk from ACM CUI 2024. The audience for that talk was mainly academic and industry researchers working on Conversational User Interfaces: in other words, users of speech synthesis rather than developers.
Each of the first two classes has one Essential reading. There are some reading tips below. You should bring a copy of the paper to class.
2026-03-10 Neural speech processing (vocoders; audio codecs; representation learning)
- Status: reading is ready
- read the SoundStream paper – an example of a neural audio codec
- Focus on:
- the big idea, expressed in Figure 2
- some understanding of the architecture in Figure 3
- the introduction in Section I
- the core idea of converting a waveform into a sequence of symbols using Vector Quantisation
- some understanding of the more advanced idea of Residual Vector Quantisation in Section III.C
- You don’t need to fully understand:
- traditional audio codecs in Section II
- denoising
- training the model using a discriminator
2026-03-17 Large Speech Language Models
- Status: reading is ready
- read “Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers” (which is about the VALL-E model) – an example of a Large Speech Language Model
- Focus on:
- the architecture in Figure 1
- the general concept of framing speech generation as language modelling
- the idea in Section 4.2.1 of auto-regressively generating audio codes, concentrating only on generating the codes for the first quantiser
- for the experiments reported in Section 5, just get an overall idea of what evaluation methods are used and what Tables 2 and 3 mean
- listening to lots of sample output provided on the demo page (only for VALL-E, not the many other models on that page)
- You don’t need to understand:
- Section 3 – just assume the audio codec is the same as SoundStream (although VALL-E uses EnCodec)
- Figure 3 and Section 4.2.2 which explain how the audio codec codes are generated in a specific order; this is necessary (but rather inconvenient) because the audio codec uses Residual Vector Quantisation.
- The ablation study within Section 5.2
2026-03-24 Beyond Text-to-Speech (cloning, conversion, anonymisation,…)
- Status: not ready
- There are no essential readings for this class, so nothing to prepare in advance!
- A list of suggested follow-up reading will be posted here after class. None of this is “Essential” or examinable.
- Last year’s list:
- L. Sun, K. Li, H. Wang, S. Kang and H. Meng, “Phonetic posteriorgrams for many-to-one voice conversion without parallel data training,” in Proc. 2016 IEEE International Conference on Multimedia and Expo (ICME), Seattle, WA, USA. doi: 10.1109/ICME.2016.7552917
- Tobing, P.L., Wu, Y.-C., Hayashi, T., Kobayashi, K., Toda, T. (2019) “Non-Parallel Voice Conversion with Cyclic Variational Autoencoder” in Proc. Interspeech 2019, Graz, Austria. doi:10.21437/Interspeech.2019-2307
Reading
Zeghidour et al. SoundStream: An End-to-End Neural Audio Codec
There are various other similar neural codecs, including Encodec and the Descript Audio Codec, but SoundStream was one of the first and has the most complete description in this journal paper.
Wang et al. Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
This paper introduces the VALL-E model, which frames speech synthesis as a language modelling task in which a sequence of audio codec codes are generated conditionally, given a preceding sequence of text (and a speech prompt).
Status: see individual classes below
Slides
- 2026-03-10 Neural speech processing (vocoders; audio codecs; representation learning)
- Download the slides for the class on 2026-03-10 (post-class version)
- 2026-03-17 Large Speech Language Models
- Download the slides for the class on 2026-03-17 (pre-class version)
- 2026-03-24 Beyond Text-to-Speech (cloning, conversion, anonymisation,…)
- Status: slides not ready
Demo pages
- Example audio codec: SoundStream
- Example Large Speech Language Models:
- Shannon Text Generator
- This is just a character N-gram trained on a small amount of text, not an LLM!
- Try training it on natural language from different domains, or even with Python code.
- Example speech editing model: VoiceCraft
- Example Voice Conversion models:
Here are some more talks from around 2017-2018 that will help you understand how we arrived at the current state-of-the-art. There is some overlap in the material, but each talk is coming from a different angle, and was for a different audience.
- Speech Synthesis: Where did the signal processing go?
- Speaking naturally? It depends who is listening…
- Does ‘end-to-end’ speech synthesis make any sense?
To go beyond this course, you need to explore the literature for yourself, to find out what the current state-of-the -art is. But you are strongly recommended to develop a good understanding of the approaches covered in the course prior to this point first, so that your understanding has a solid foundation.
Here are the key places to start looking for good papers:
Conferences
- Interspeech (the Annual Conference of the International Speech Communication Association) – start with the most recent year and work backwards
- IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- Speech Synthesis Workshop – type “SSW” in the search box
- Blizzard Challenge workshops