Wu et al. Merlin: An Open Source Neural Network Speech Synthesis System

Merlin is a toolkit for building Deep Neural Network models for statistical parametric speech synthesis. It is a typical frame-by-frame approach, pre-dating sequence-to-sequence models.

Wu et al: Deep neural networks employing Multi-Task Learning…

Some straightforward, but effective techniques to improve the performance of speech synthesis using simple feedforward networks.

Watts et al. Where do the improvements come from in sequence-to-sequence neural TTS?

A systematic investigation of the benefits of moving from frame-by-frame models to sequence-to-sequence models.

Watts et al: From HMMs to DNNs: where do the improvements come from?

Measures the relative contributions of the key differences in the regression model, state vs. frame predictions, and separate vs. combined stream predictions.

Taylor – Section 12.7 – Pitch and epoch detection

Only an outline of the main approaches, with little technical detail. Useful as a summary of why these tasks are harder than you might think.

Taylor – Section 12.4 – Linear-Prediction Analysis

An overview of the background and maths behind linear-prediction methods for modelling the vocal tract as a filter.

Taylor – Chapter 15 – Hidden-Markov-model synthesis

Written with a traditional “starting from automatic speech recognition” viewpoint, you will need to make the connections for yourself to the more general concept of text-to-speech as a regression problem.

Tachibana et al. Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention

DCTTS is comparable to Tacotron, but is faster because it uses non-recurrent architectures for the encoder and decoder.

Ren et al. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

FastSpeech 2 improves over FastSpeech by not requiring a complicated teacher-student training regime, but instead being trained directly on the data. It is very similar to FastPitch, which was released around the same by different authors.

Qian et al: A Unified Trajectory Tiling Approach to High Quality Speech Rendering

The term “trajectory tiling” means that trajectories from a statistical model (HMMs in this case) are not input to a vocoder, but are “covered over” or “tiled” with waveform fragments.

Pollet & Breen: Synthesis by Generation and Concatenation of Multiform Segments

Another way to combine waveform concatenation and SPSS is to alternate between waveform fragments and vocoder-generated waveforms.

Norrenbrock et al: Quality prediction of synthesised speech…

Although standard speech quality measures such as PESQ do not work well for synthetic speech, specially constructed methods do work to some extent.