Module status: not ready
Before you do this module, you should complete the Essential readings for all the previous modules. So, if you haven’t yet done all of them, now would be a great time to catch up. You need to build that foundation before moving up to the more advanced material in this module.
Also make sure you have watched all the videos up to this point. You may find it helpful to watch the optional What is “end-to-end” speech synthesis? in the second video tab of Module 8 before moving on.
Because we are now covering very recent developments, which change every year, there are no videos for this module. We’ll cover everything in class. It is therefore doubly-important to do the Essential readings beforehand.
The most successful approach at the moment is to use sequence-to-sequence neural networks (sometimes called encoder-decoder networks because they almost always use that neural architecture). These models need to solve three problems
- regression from input to output
- alignment during training
- duration prediction during inference (synthesis)
All models solve 1 in essentially the same way, using an encoder-decoder architecture. Of course there are many, many choices to make in the architectures of the encoder and decoder, but the main conceptual difference between models is in how they solve problems 2 and 3.
One class of models (e.g., Tacotron 2) attempts to jointly solve 2 and 3 using a single neural architecture. Another class of models (e.g., FastPitch) uses separate mechanisms for 2 and 3.
Whilst it appears elegant to solve two problems with a single architecture, we know that the problem of alignment is actually very different from the problem of duration prediction. Alignment is very similar to Automatic Speech Recognition (ASR), so we might want to take advantage of the best available ASR models to do that. In contrast, duration prediction is a straightforward regression task.
Reading
Shen et al: Natural TTS Synthesis By Conditioning Wavenet On Mel Spectrogram Predictions
Tacotron 2 was one of the most successful sequence-to-sequence models for text-to-speech of its time and inspired many subsequent models.
Łańcucki. FastPitch: Parallel Text-to-speech with Pitch Prediction
Very similar to FastSpeech2, FastPitch has the advantage of an official Open Source implementation by the author (at NVIDIA).
Watts et al. Where do the improvements come from in sequence-to-sequence neural TTS?
A systematic investigation of the benefits of moving from frame-by-frame models to sequence-to-sequence models.
Tachibana et al. Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention
DCTTS is comparable to Tacotron, but is faster because it uses non-recurrent architectures for the encoder and decoder.
Ren et al. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
FastSpeech 2 improves over FastSpeech by not requiring a complicated teacher-student training regime, but instead being trained directly on the data. It is very similar to FastPitch 2, which was released around the same by different authors.
You can listen to samples comparing FastPitch and Tacotron at https://fastpitch.github.io/
In the remainder of the course we will look at some state of the art models that are currently in use. These will be sequence-to-sequence models of the type introduced in this module. We will be reading recent research papers about these models, so make sure to allow plenty of time to do that before the class: these are challenging papers and will require multiple read-throughs.