Paul Taylor “Text-to-speech synthesis”, 2009, Cambridge University Press, Cambridge, ISBN 0521899273
Taylor - Chapter 3 - The text-to-speech problem
Discusses the differences between spoken and written forms of language, and describes the structure of a typical TTS system.
Taylor - Chapter 6 - Prosody prediction from text
Predicting phrasing, prominence, intonation and tune, from text input.
Taylor - Chapter 8 - Pronunciation
Including how the lexicon is stored, letter-to-sound, and compressing the lexicon.
Taylor - Chapter 10 - Signals and filters
Focus on the concepts and diagrams, and don't worry about understanding the maths too much.
Taylor - Section 10.1 - Analogue signals
It's easier to start by understanding physical signals - which are analogue - before we then approximate them digitally.
Taylor - Section 10.2 - Digital signals
Going digital involves approximations in the way an original analogue signal is represented.
Taylor - Chapter 12 - Analysis of speech signals
Includes spectral envelope extraction (cepstrum or LPC), source representation (the residual), pitch tracking and pitch marking.
Taylor - Section 12.3 - The cepstrum
By using the logarithm to convert a multiplication into a sum, the cepstrum separates the source and filter components of…
Taylor - Section 12.4 - Linear-Prediction Analysis
An overview of the background and maths behind linear-prediction methods for modelling the vocal tract as a filter.
Taylor - Section 12.7 - Pitch and epoch detection
Only an outline of the main approaches, with little technical detail. Useful as a summary of why these tasks are…
Taylor - Chapter 15 - Hidden-Markov-model synthesis
Written with a traditional "starting from automatic speech recognition" viewpoint, you will need to make the connections for yourself to the more general concept of text-to-speech as a regression problem.
Taylor - Chapter 16 - Unit-selection synthesis
A substantial chapter covering target cost, join cost and search.
Taylor - Chapter 17 - Further issues
Databases, evaluation, audio-visual synthesis, expressive speech
Taylor - Section 17.2 - Evaluation
Testing of the system by the developers, as well as via listening tests.