Shen et al: Natural TTS Synthesis By Conditioning Wavenet On Mel Spectrogram Predictions

Tacotron 2 was one of the most successful sequence-to-sequence models for text-to-speech of its time and inspired many subsequent models.

Watts et al: From HMMs to DNNs: where do the improvements come from?

Measures the relative contributions of the key differences in the regression model, state vs. frame predictions, and separate vs. combined stream predictions.

Pollet & Breen: Synthesis by Generation and Concatenation of Multiform Segments

Another way to combine waveform concatenation and SPSS is to alternate between waveform fragments and vocoder-generated waveforms.

Qian et al: A Unified Trajectory Tiling Approach to High Quality Speech Rendering

The term “trajectory tiling” means that trajectories from a statistical model (HMMs in this case) are not input to a vocoder, but are “covered over” or “tiled” with waveform fragments.

Wu et al: Deep neural networks employing Multi-Task Learning…

Some straightforward, but effective techniques to improve the performance of speech synthesis using simple feedforward networks.

Zen et al: Statistical parametric speech synthesis using deep neural networks

The first paper that re-introduced the use of (Deep) Neural Networks in speech synthesis.

Ling et al: Deep Learning for Acoustic Modeling in Parametric Speech Generation

A key review article.

Gurney: An introduction to neural networks

Somewhat old, but might be helpful in getting some of the basic concepts clear, if you find Nielsen’s “Neural Networks and Deep Learning” too difficult to start with.

Nielsen: Neural Networks and Deep Learning

A great introduction. Relatively light on maths, and with some interactive explanations.

Zen, Black & Tokuda: Statistical parametric speech synthesis

A review article that makes some useful connections between HMM-based speech synthesis and unit selection.

Taylor – Chapter 15 – Hidden-Markov-model synthesis

Written with a traditional “starting from automatic speech recognition” viewpoint, you will need to make the connections for yourself to the more general concept of text-to-speech as a regression problem.

King: A beginners’ guide to statistical parametric speech synthesis

A deliberately gentle, non-technical introduction to the topic. Every item in the small and carefully-chosen bibliography is worth following up.