Module 9 – sequence-to-sequence models

True sequence-to-sequence models improve over frame-by-frame models by encoding the entire input sequence then generating the entire output sequence

Start Readings Class Finish

Module status: ready

Before you do this module, you should complete the Essential readings for all the previous modules. So, if you haven’t yet done all of them, now would be a great time to catch up. You need to build that foundation before moving up to the more advanced material in this module.

Also make sure you have watched all the videos up to this point. You may find it helpful to watch the optional What is “end-to-end” speech synthesis? in the second video tab of Module 8 before moving on.

Because we are now covering very recent developments, which change every year, there are no videos for this module. We’ll cover everything in class. It is therefore doubly-important to do the Essential readings beforehand.

The most successful approach at the moment is to use sequence-to-sequence neural networks (sometimes called encoder-decoder networks because they almost always use that neural architecture). These models need to solve three problems

regression from input to output
alignment during training
duration prediction during inference (synthesis)

All models solve 1 in essentially the same way, using an encoder-decoder architecture. Of course there are many, many choices to make in the architectures of the encoder and decoder, but the main conceptual difference between models is in how they solve problems 2 and 3.

One class of models (e.g., Tacotron 2) attempts to jointly solve 2 and 3 using a single neural architecture. Another class of models (e.g., FastPitch) uses separate mechanisms for 2 and 3.

Whilst it appears elegant to solve two problems with a single architecture, we know that the problem of alignment is actually very different from the problem of duration prediction. Alignment is very similar to Automatic Speech Recognition (ASR), so we might want to take advantage of the best available ASR models to do that. In contrast, duration prediction is a straightforward regression task.

Reading

Shen et al. Natural TTS Synthesis By Conditioning Wavenet On Mel Spectrogram Predictions

Tacotron 2 was one of the most successful sequence-to-sequence models for text-to-speech of its time and inspired many subsequent models.

Łańcucki. FastPitch: Parallel Text-to-speech with Pitch Prediction

Very similar to FastSpeech2, FastPitch has the advantage of an official Open Source implementation by the author (at NVIDIA).

Watts et al. Where do the improvements come from in sequence-to-sequence neural TTS?

A systematic investigation of the benefits of moving from frame-by-frame models to sequence-to-sequence models.

Tachibana et al. Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention

DCTTS is comparable to Tacotron, but is faster because it uses non-recurrent architectures for the encoder and decoder.

Ren et al. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

FastSpeech 2 improves over FastSpeech by not requiring a complicated teacher-student training regime, but instead being trained directly on the data. It is very similar to FastPitch, which was released around the same by different authors.

Download the slides for the class on 2025-03-11

You need to search for example model output on your own, as part of exploring the literature. You should have found the official Google Tacotron 2 samples page. For the state-of-the-art models coming up in the remainder of the course, first read the paper, then look for samples to listen to.

You can listen to samples comparing FastPitch and Tacotron at https://fastpitch.github.io/

In the remainder of the course we will look at some state of the art models that are currently in use. These will be sequence-to-sequence models. We will be reading recent research papers about these models, so make sure to allow plenty of time to do that before the class: these are challenging papers and will require multiple read-throughs.