Zen et al: Statistical parametric speech synthesis using deep neural networks

The first paper that re-introduced the use of (Deep) Neural Networks in speech synthesis.

Zeghidour et al. SoundStream: An End-to-End Neural Audio Codec

There are various other similar neural codecs, including Encodec and the Descript Audio Codec, but SoundStream was one of the first and has the most complete description in this journal paper.

Wang et al. Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

This paper introduces the VALL-E model, which frames speech synthesis as a language modelling task in which a sequence of audio codec codes are generated conditionally, given a preceding sequence of text (and a speech prompt).

Taylor – Section 17.2 – Evaluation

Testing of the system by the developers, as well as via listening tests.

Taylor – Section 17.1 – Databases

Including the important issue of labelling the data

Taylor – Chapter 3 – The text-to-speech problem

Discusses the differences between spoken and written forms of language, and describes the structure of a typical TTS system.

Taylor – Chapter 16 – Unit-selection synthesis

A substantial chapter covering target cost, join cost and search.

Talkin: A Robust Algorithm for Pitch Tracking (RAPT)

The classic algorithm for estimating F0 from speech signals.

Shen et al: Natural TTS Synthesis By Conditioning Wavenet On Mel Spectrogram Predictions

Tacotron 2 was one of the most successful sequence-to-sequence models for text-to-speech of its time and inspired many subsequent models.

Łańcucki. FastPitch: Parallel Text-to-speech with Pitch Prediction

Very similar to FastSpeech2, FastPitch has the advantage of an official Open Source implementation by the author (at NVIDIA).

Ling et al: Deep Learning for Acoustic Modeling in Parametric Speech Generation

A key review article.

Kominek & Black: CMU ARCTIC databases for speech synthesis

Widely used, copyright-free speech databases for use in speech synthesis