Nielsen: Neural Networks and Deep Learning

A great introduction. Relatively light on maths, and with some interactive explanations.

Mayo et al: Multidimensional scaling of listener responses to synthetic speech

Multi-dimensional scaling is a way to uncover the different perceptual dimensions that listeners use, when rating synthetic speech.

Ling et al: Deep Learning for Acoustic Modeling in Parametric Speech Generation

A key review article.

King: Measuring a decade of progress in Text-to-Speech

A distillation of the key findings of the first 10 years of the Blizzard Challenge.

King et al: Speech synthesis using non-uniform units in the Verbmobil project

Of purely historical interest, this is an example of a system using a heterogeneous unit type inventory, developed shortly before Hunt & Black published their influential paper.

Handbook of phonetic sciences – Ch 20 – Intro to Signal Processing for Speech (Sections 6-7)

Written for a non-technical audience, this gently introduces some key concepts in speech signal processing. Read sections 6-7.

Handbook of phonetic sciences – Ch 20 – Intro to Signal Processing for Speech (Sections 1-5)

Written for a non-technical audience, this gently introduces some key concepts in speech signal processing. Read sections 1-5 (up to and including ‘Fourier Analysis’).

Handbook of phonetic sciences – Ch 20 – Intro to Signal Processing for Speech

Written for a non-technical audience, this gently introduces some key concepts in speech signal processing.

Gurney: An introduction to neural networks

Somewhat old, but might be helpful in getting some of the basic concepts clear, if you find Nielsen’s “Neural Networks and Deep Learning” too difficult to start with.

Clark et al: Multisyn: Open-domain unit selection for the Festival speech synthesis system

A description of the implementation and evaluation of Festival’s unit selection engine, called Multisyn.