A great introduction. Relatively light on maths, and with some interactive explanations.
Mayo et al: Multidimensional scaling of listener responses to synthetic speech
Multi-dimensional scaling is a way to uncover the different perceptual dimensions that listeners use, when rating synthetic speech.
Ling et al: Deep Learning for Acoustic Modeling in Parametric Speech Generation
A key review article.
King: Measuring a decade of progress in Text-to-Speech
A distillation of the key findings of the first 10 years of the Blizzard Challenge.
King et al: Speech synthesis using non-uniform units in the Verbmobil project
Of purely historical interest, this is an example of a system using a heterogeneous unit type inventory, developed shortly before Hunt & Black published their influential paper.
Handbook of phonetic sciences – Ch 20 – Intro to Signal Processing for Speech (Sections 6-7)
Written for a non-technical audience, this gently introduces some key concepts in speech signal processing. Read sections 6-7.
Handbook of phonetic sciences – Ch 20 – Intro to Signal Processing for Speech (Sections 1-5)
Written for a non-technical audience, this gently introduces some key concepts in speech signal processing. Read sections 1-5 (up to and including ‘Fourier Analysis’).
Handbook of phonetic sciences – Ch 20 – Intro to Signal Processing for Speech
Written for a non-technical audience, this gently introduces some key concepts in speech signal processing.
Gurney: An introduction to neural networks
Somewhat old, but might be helpful in getting some of the basic concepts clear, if you find Nielsen’s “Neural Networks and Deep Learning” too difficult to start with.
Clark et al: Multisyn: Open-domain unit selection for the Festival speech synthesis system
A description of the implementation and evaluation of Festival’s unit selection engine, called Multisyn.


This is the new version. Still under construction.