Zen, Black & Tokuda: Statistical parametric speech synthesis

A review article that makes some useful connections between HMM-based speech synthesis and unit selection.

Wu et al. Merlin: An Open Source Neural Network Speech Synthesis System

Merlin is a toolkit for building Deep Neural Network models for statistical parametric speech synthesis. It is a typical frame-by-frame approach, pre-dating sequence-to-sequence models.

Watts et al. Where do the improvements come from in sequence-to-sequence neural TTS?

A systematic investigation of the benefits of moving from frame-by-frame models to sequence-to-sequence models.

Taylor – Chapter 15 – Hidden-Markov-model synthesis

Written with a traditional “starting from automatic speech recognition” viewpoint, you will need to make the connections for yourself to the more general concept of text-to-speech as a regression problem.

Nielsen: Neural Networks and Deep Learning

A great introduction. Relatively light on maths, and with some interactive explanations.

Kawahara et al: Restructuring speech representations…

The key paper about the STRAIGHT vocoder, which was originally intended for manipulating recorded natural speech.

Jurafsky & Martin – Section 8.5 – Unit Selection (Waveform) Synthesis

A brief explanation. Worth reading before tackling the more substantial chapter in Taylor (Speech Synthesis course only).

Gurney: An introduction to neural networks

Somewhat old, but might be helpful in getting some of the basic concepts clear, if you find Nielsen’s “Neural Networks and Deep Learning” too difficult to start with.

Fitt & Isard: Synthesis of regional English using a keyword lexicon

An extension and practical application of Wells’ keyvowels idea, which enables efficient generation of a pronunciation dictionary tailored to a specific accent or speaker.

Clark et al: Statistical analysis of the Blizzard Challenge 2007 listening test results

Explains the types of statistical tests that are employed in the Blizzard Challenge. These are deliberately quite conservative. For example, MOS data is correctly treated as ordinal. Also includes a Multi-Dimensional Scaling (MDS) section that is not as widely used as the other types of analysis.

Clark et al: Festival 2 – build your own general purpose unit selection speech synthesiser

Discusses some of the design choices made when writing Festival’s unit selection engine (Multisyn) and the tools for building new voices.

Benoît et al: The SUS test

A method for evaluating the intelligibility of synthetic speech, which avoids the ceiling effect.