Module status: ready
Before you start this module, you should complete the Essential readings for all the previous modules. So, if you haven’t yet done all of them, now would be a great time to catch up. You need to build that foundation before moving up to the more advanced material in this module.
Also make sure you have watched all the videos up to this point. You may find it helpful to watch the optional What is “end-to-end” speech synthesis? in the second video tab of Module 8 before moving on.
Because we are now covering very recent developments, which change every year, there are no videos for this module. We’ll cover everything in class. It is therefore doubly-important to do the Essential readings beforehand.
In this module: sequence-to-sequence models (encoder-decoder architecture)
A widely-used approach is to treat text-to-speech as a sequence-to-sequence problem and the most common choice of model is a neural network with an encoder-decoder architecture. To understand encoder-decoder models, it is helpful to think in terms of solving three problems:
- regression from input to output
- alignment of the input and output sequences, during training
- duration prediction of the output sequence, during inference (synthesis)
In an encoder-decoder architecture, the encoder accepts the input sequence and transforms it (i.e., performs regression) into a learned internal representation. The decoder accepts this representation and transforms it (i.e., performs regression) into the output sequence.
The model designer must decide on the representations of the input and output sequences, but the model learns the internal representation. Problem 1 is solved by the encoder and the decoder: input → learned internal representation → output
The difference in timescales between the input and output sequences is taken care of inbetween the encoder and decoder. One class of models (e.g., Tacotron 2) treats Problems 2 and 3 as the same thing and solves them using a single neural mechanism called attention. Another class of models (e.g., FastPitch) uses separate mechanisms for Problems 2 and 3. Whilst it appears elegant to solve two problems with a single mechanism, the problem of alignment is actually different from the problem of duration prediction.