Module status: ready
Before you start this module, you should complete the Essential readings for all the previous modules. So, if you haven’t yet done all of them, now would be a great time to catch up. You need to build that foundation before moving up to the more advanced material in this module.
Also make sure you have watched all the videos up to this point. You may find it helpful to watch the optional What is “end-to-end” speech synthesis? in the second video tab of Module 8 before moving on.
Because we are now covering very recent developments, which change every year, there are no videos for this module. We’ll cover everything in class. It is therefore doubly-important to do the Essential readings beforehand.
In this module: sequence-to-sequence models (encoder-decoder architecture)
A widely-used approach is to treat text-to-speech as a sequence-to-sequence problem and the most common choice of model is a neural network with an encoder-decoder architecture. To understand encoder-decoder models, it is helpful to think in terms of solving three problems:
- regression from input to output
- alignment of the input and output sequences, during training
- duration prediction of the output sequence, during inference (synthesis)
In an encoder-decoder architecture, the encoder accepts the input sequence and transforms it (i.e., performs regression) into a learned internal representation. The decoder accepts this representation and transforms it (i.e., performs regression) into the output sequence.
The model designer must decide on the representations of the input and output sequences, but the model learns the internal representation. Problem 1 is solved by the encoder and the decoder: input → learned internal representation → output
The difference in timescales between the input and output sequences is taken care of inbetween the encoder and decoder. One class of models (e.g., Tacotron 2) treats Problems 2 and 3 as the same thing and solves them using a single neural mechanism called attention. Another class of models (e.g., FastPitch) uses separate mechanisms for Problems 2 and 3. Whilst it appears elegant to solve two problems with a single mechanism, the problem of alignment is actually different from the problem of duration prediction.
Reading
Shen et al. Natural TTS Synthesis By Conditioning Wavenet On Mel Spectrogram Predictions
Tacotron 2 was one of the most successful sequence-to-sequence models for text-to-speech of its time and inspired many subsequent models.
Łańcucki. FastPitch: Parallel Text-to-speech with Pitch Prediction
Very similar to FastSpeech2, FastPitch has the advantage of an official Open Source implementation by the author (at NVIDIA).
Watts et al. Where do the improvements come from in sequence-to-sequence neural TTS?
A systematic investigation of the benefits of moving from frame-by-frame models to sequence-to-sequence models.
Tachibana et al. Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention
DCTTS is comparable to Tacotron, but is faster because it uses non-recurrent architectures for the encoder and decoder.
Ren et al. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
FastSpeech 2 improves over FastSpeech by not requiring a complicated teacher-student training regime, but instead being trained directly on the data. It is very similar to FastPitch, which was released around the same by different authors.
This week’s class will develop sequence-to-sequence models by improving on the simple DNN from the previous module. After covering the necessary neural building blocks, we will do two case studies based on the Essential readings. Make sure you have read both papers beforehand, and bring copies with you to class.
Download the slides for the class on 2026-03-03
The slides are now the post-class version. The case study on FastPitch was not covered in class, so it is set as homework. We will quickly go over this in the next class.
You need to search for example model output on your own, as part of exploring the literature. You should have found the official Google Tacotron 2 samples page. You can listen to samples comparing FastPitch and Tacotron at https://fastpitch.github.io/ For the state-of-the-art models coming up in the remainder of the course, first read the paper, then search for samples to listen to (sometimes the paper will have a link to a demo page).
The remainder of the course will cover the State-of-the-Art. We will be reading recent research papers, so make sure to allow plenty of time to do that before the class: these are challenging papers and will require multiple read-throughs.
Roadmap:
- Neural speech processing (vocoders; audio codecs; representation learning)
- we need to revisit representations of both text and speech; the key advance will be to find a discrete representation of speech
- Large Speech Language Models
- a discrete representation of speech will enable us to use models that can only generate discrete representations: language models
- Beyond Text-to-Speech (cloning, conversion, anonymisation,…)
- yes, there is more to life than TTS! We don’t have to limit ourselves to textual input!