Module status: ready
Before you do this module, you should complete the Essential readings for all the previous modules. So, if you haven’t yet done all of them, now would be a great time to catch up. You need to build that foundation before moving up to the more advanced material in this module.
Also make sure you have watched all the videos up to this point. You may find it helpful to watch the optional What is “end-to-end” speech synthesis? in the second video tab of Module 8 before moving on.
Because we are now covering very recent developments, which change every year, there are no videos for this module. We’ll cover everything in class. It is therefore doubly-important to do the Essential readings beforehand.
The most successful approach at the moment is to use sequence-to-sequence neural networks (sometimes called encoder-decoder networks because they almost always use that neural architecture). These models need to solve three problems
- regression from input to output
- alignment during training
- duration prediction during inference (synthesis)
All models solve 1 in essentially the same way, using an encoder-decoder architecture. Of course there are many, many choices to make in the architectures of the encoder and decoder, but the main conceptual difference between models is in how they solve problems 2 and 3.
One class of models (e.g., Tacotron 2) attempts to jointly solve 2 and 3 using a single neural architecture. Another class of models (e.g., FastPitch) uses separate mechanisms for 2 and 3.
Whilst it appears elegant to solve two problems with a single architecture, we know that the problem of alignment is actually very different from the problem of duration prediction. Alignment is very similar to Automatic Speech Recognition (ASR), so we might want to take advantage of the best available ASR models to do that. In contrast, duration prediction is a straightforward regression task.