Wang et al. Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

This paper introduces the VALL-E model, which frames speech synthesis as a language modelling task in which a sequence of audio codec codes are generated conditionally, given a preceding sequence of text (and a speech prompt).

Zeghidour et al. SoundStream: An End-to-End Neural Audio Codec

There are various other similar neural codecs, including Encodec and the Descript Audio Codec, but SoundStream was one of the first and has the most complete description in this journal paper.

Taylor – Section 12.4 – Linear-Prediction Analysis

An overview of the background and maths behind linear-prediction methods for modelling the vocal tract as a filter.

Schaedler – Seeing Circles, Sines and Signals

A very nice concise primer on the basic components of digital signal processing with great visual demonstrations.

Wayland (Phonetics) – Chapter 1 – Speech Articulation: Manner and Place

A concise introduction to articulatory phonetics

Tachibana et al. Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention

DCTTS is comparable to Tacotron, but is faster because it uses non-recurrent architectures for the encoder and decoder.

Wu et al. Merlin: An Open Source Neural Network Speech Synthesis System

Merlin is a toolkit for building Deep Neural Network models for statistical parametric speech synthesis. It is a typical frame-by-frame approach, pre-dating sequence-to-sequence models.

Ren et al. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

FastSpeech 2 improves over FastSpeech by not requiring a complicated teacher-student training regime, but instead being trained directly on the data. It is very similar to FastPitch 2, which was released around the same by different authors.

Łańcucki. FastPitch: Parallel Text-to-speech with Pitch Prediction

Very similar to FastSpeech2, FastPitch has the advantage of an official Open Source implementation by the author (at NVIDIA).

King et al: Speech synthesis using non-uniform units in the Verbmobil project

Of purely historical interest, this is an example of a system using a heterogeneous unit type inventory, developed shortly before Hunt & Black published their influential paper.

Jurafsky & Martin (3rd Ed) – Hidden Markov models

An overview of Hidden Markov Models, the Viterbi algorithm, and the Baum-Welch algorithm

Wayland (Phonetics) – Chapter 9 – Hearing

Introduces basic concepts in human hearing – it may be useful to read the bits on decibels/loudness and the Mel and Bark scales.