Merlin is a toolkit for building Deep Neural Network models for statistical parametric speech synthesis. It is a typical frame-by-frame approach, pre-dating sequence-to-sequence models.
Watts et al. Where do the improvements come from in sequence-to-sequence neural TTS?
A systematic investigation of the benefits of moving from frame-by-frame models to sequence-to-sequence models.
Gurney: An introduction to neural networks
Somewhat old, but might be helpful in getting some of the basic concepts clear, if you find Nielsen’s “Neural Networks and Deep Learning” too difficult to start with.
Nielsen: Neural Networks and Deep Learning
A great introduction. Relatively light on maths, and with some interactive explanations.
Zen, Black & Tokuda: Statistical parametric speech synthesis
A review article that makes some useful connections between HMM-based speech synthesis and unit selection.
Taylor – Chapter 15 – Hidden-Markov-model synthesis
Written with a traditional “starting from automatic speech recognition” viewpoint, you will need to make the connections for yourself to the more general concept of text-to-speech as a regression problem.
Kawahara et al: Restructuring speech representations…
The key paper about the STRAIGHT vocoder, which was originally intended for manipulating recorded natural speech.
Clark et al: Statistical analysis of the Blizzard Challenge 2007 listening test results
Explains the types of statistical tests that are employed in the Blizzard Challenge. These are deliberately quite conservative. For example, MOS data is correctly treated as ordinal. Also includes a Multi-Dimensional Scaling (MDS) section that is not as widely used as the other types of analysis.
Benoît et al: The SUS test
A method for evaluating the intelligibility of synthetic speech, which avoids the ceiling effect.
Fitt & Isard: Synthesis of regional English using a keyword lexicon
An extension and practical application of Wells’ keyvowels idea, which enables efficient generation of a pronunciation dictionary tailored to a specific accent or speaker.
Jurafsky & Martin – Section 8.5 – Unit Selection (Waveform) Synthesis
A brief explanation. Worth reading before tackling the more substantial chapter in Taylor (Speech Synthesis course only).
Clark et al: Festival 2 – build your own general purpose unit selection speech synthesiser
Discusses some of the design choices made when writing Festival’s unit selection engine (Multisyn) and the tools for building new voices.