Tachibana et al. Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention

DCTTS is comparable to Tacotron, but is faster because it uses non-recurrent architectures for the encoder and decoder.

Wu et al. Merlin: An Open Source Neural Network Speech Synthesis System

Merlin is a toolkit for building Deep Neural Network models for statistical parametric speech synthesis. It is a typical frame-by-frame approach, pre-dating sequence-to-sequence models.

Ren et al. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

FastSpeech 2 improves over FastSpeech by not requiring a complicated teacher-student training regime, but instead being trained directly on the data. It is very similar to FastPitch 2, which was released around the same by different authors.

Łańcucki. FastPitch: Parallel Text-to-speech with Pitch Prediction

Very similar to FastSpeech2, FastPitch has the advantage of an official Open Source implementation by the author (at NVIDIA).

King et al: Speech synthesis using non-uniform units in the Verbmobil project

Of purely historical interest, this is an example of a system using a heterogeneous unit type inventory, developed shortly before Hunt & Black published their influential paper.

Jurafsky & Martin (3rd Ed) – Hidden Markov models

An overview of Hidden Markov Models, the Viterbi algorithm, and the Baum-Welch algorithm

Wayland (Phonetics) – Chapter 9 – Hearing

Introduces basic concepts in human hearing – it may be useful to read the bits on decibels/loudness and the Mel and Bark scales.

Wayland (Phonetics) – Chapter 5 – Phonemic and Morphophonemic Analysis

An introduction to the concept of phonemes, allophones and some common phonological alternations.

Plag (2003) – Word formation in English: Chapter 1 Basic Concepts

An introductory text of word structure/morphology in English. Useful to read if you come from a non-linguistic background.

Johnson (Phonetics) – Chapter 6.1 – Tube models of vowel production

Deriving the resonances and formant structures of vowels using 2 and 3 tube models of the vocal tract.

Wayland (Phonetics) – Chapter 8 – Acoustic Properties of Vowels and Consonants

An overview of the properties of vowels and consonants

Johnson (Phonetics) – Chapter 2 – The Acoustic Theory of Speech Production: Deriving Schwa

Derives the acoustic features of the vocal tract in terms of the source-filter model