A key review article.
Gurney: An introduction to neural networks
Somewhat old, but might be helpful in getting some of the basic concepts clear, if you find Nielsen’s “Neural Networks and Deep Learning” too difficult to start with.
Nielsen: Neural Networks and Deep Learning
A great introduction. Relatively light on maths, and with some interactive explanations.
Taylor – Chapter 15 – Hidden-Markov-model synthesis
Written with a traditional “starting from automatic speech recognition” viewpoint, you will need to make the connections for yourself to the more general concept of text-to-speech as a regression problem.
Taylor – Section 12.7 – Pitch and epoch detection
Only an outline of the main approaches, with little technical detail. Useful as a summary of why these tasks are harder than you might think.
King: Measuring a decade of progress in Text-to-Speech
A distillation of the key findings of the first 10 years of the Blizzard Challenge.
Norrenbrock et al: Quality prediction of synthesised speech…
Although standard speech quality measures such as PESQ do not work well for synthetic speech, specially constructed methods do work to some extent.
Mayo et al: Multidimensional scaling of listener responses to synthetic speech
Multi-dimensional scaling is a way to uncover the different perceptual dimensions that listeners use, when rating synthetic speech.
Clark et al: Multisyn: Open-domain unit selection for the Festival speech synthesis system
A description of the implementation and evaluation of Festival’s unit selection engine, called Multisyn.
Handbook of phonetic sciences – Ch 20 – Intro to Signal Processing for Speech
Written for a non-technical audience, this gently introduces some key concepts in speech signal processing.




This is the new version. Still under construction.