Wu et al: Deep neural networks employing Multi-Task Learning…

Some straightforward, but effective techniques to improve the performance of speech synthesis using simple feedforward networks.

Watts et al: From HMMs to DNNs: where do the improvements come from?

Measures the relative contributions of the key differences in the regression model, state vs. frame predictions, and separate vs. combined stream predictions.

Taylor – Section 12.7 – Pitch and epoch detection

Only an outline of the main approaches, with little technical detail. Useful as a summary of why these tasks are harder than you might think.

Taylor – Section 12.4 – Linear-Prediction Analysis

An overview of the background and maths behind linear-prediction methods for modelling the vocal tract as a filter.

Tachibana et al. Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention

DCTTS is comparable to Tacotron, but is faster because it uses non-recurrent architectures for the encoder and decoder.

Ren et al. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

FastSpeech 2 improves over FastSpeech by not requiring a complicated teacher-student training regime, but instead being trained directly on the data. It is very similar to FastPitch 2, which was released around the same by different authors.

Qian et al: A Unified Trajectory Tiling Approach to High Quality Speech Rendering

The term “trajectory tiling” means that trajectories from a statistical model (HMMs in this case) are not input to a vocoder, but are “covered over” or “tiled” with waveform fragments.

Pollet & Breen: Synthesis by Generation and Concatenation of Multiform Segments

Another way to combine waveform concatenation and SPSS is to alternate between waveform fragments and vocoder-generated waveforms.

Norrenbrock et al: Quality prediction of synthesised speech…

Although standard speech quality measures such as PESQ do not work well for synthetic speech, specially constructed methods do work to some extent.

Mayo et al: Multidimensional scaling of listener responses to synthetic speech

Multi-dimensional scaling is a way to uncover the different perceptual dimensions that listeners use, when rating synthetic speech.

King: Measuring a decade of progress in Text-to-Speech

A distillation of the key findings of the first 10 years of the Blizzard Challenge.

King et al: Speech synthesis using non-uniform units in the Verbmobil project

Of purely historical interest, this is an example of a system using a heterogeneous unit type inventory, developed shortly before Hunt & Black published their influential paper.