Somewhat old, but might be helpful in getting some of the basic concepts clear, if you find Nielsen’s “Neural Networks and Deep Learning” too difficult to start with.
Fitt & Isard: Synthesis of regional English using a keyword lexicon
An extension and practical application of Wells’ keyvowels idea, which enables efficient generation of a pronunciation dictionary tailored to a specific accent or speaker.
Clark et al: Statistical analysis of the Blizzard Challenge 2007 listening test results
Explains the types of statistical tests that are employed in the Blizzard Challenge. These are deliberately quite conservative. For example, MOS data is correctly treated as ordinal. Also includes a Multi-Dimensional Scaling (MDS) section that is not as widely used as the other types of analysis.
Clark et al: Multisyn: Open-domain unit selection for the Festival speech synthesis system
A description of the implementation and evaluation of Festival’s unit selection engine, called Multisyn.
Clark et al: Festival 2 – build your own general purpose unit selection speech synthesiser
Discusses some of the design choices made when writing Festival’s unit selection engine (Multisyn) and the tools for building new voices.
Benoît et al: The SUS test
A method for evaluating the intelligibility of synthetic speech, which avoids the ceiling effect.
Bennett: Large Scale Evaluation of Corpus-based Synthesisers
An analysis of the first Blizzard Challenge, which is an evaluation of speech synthesisers using a common database.