Wang et al. Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

This paper introduces the VALL-E model, which frames speech synthesis as a language modelling task in which a sequence of audio codec codes are generated conditionally, given a preceding sequence of text (and a speech prompt).

Zeghidour et al. SoundStream: An End-to-End Neural Audio Codec

There are various other similar neural codecs, including Encodec and the Descript Audio Codec, but SoundStream was one of the first and has the most complete description in this journal paper.

Łańcucki. FastPitch: Parallel Text-to-speech with Pitch Prediction

Very similar to FastSpeech2, FastPitch has the advantage of an official Open Source implementation by the author (at NVIDIA).

Shen et al: Natural TTS Synthesis By Conditioning Wavenet On Mel Spectrogram Predictions

Tacotron 2 was one of the most successful sequence-to-sequence models for text-to-speech of its time and inspired many subsequent models.

Zen et al: Statistical parametric speech synthesis using deep neural networks

The first paper that re-introduced the use of (Deep) Neural Networks in speech synthesis.

Ling et al: Deep Learning for Acoustic Modeling in Parametric Speech Generation

A key review article.

King: A beginners’ guide to statistical parametric speech synthesis

A deliberately gentle, non-technical introduction to the topic. Every item in the small and carefully-chosen bibliography is worth following up.

Talkin: A Robust Algorithm for Pitch Tracking (RAPT)

The classic algorithm for estimating F0 from speech signals.

Bennett: Large Scale Evaluation of Corpus-based Synthesisers

An analysis of the first Blizzard Challenge, which is an evaluation of speech synthesisers using a common database.

Kominek & Black: CMU ARCTIC databases for speech synthesis

Widely used, copyright-free speech databases for use in speech synthesis

Taylor – Section 17.2 – Evaluation

Testing of the system by the developers, as well as via listening tests.

Taylor – Section 17.1 – Databases

Including the important issue of labelling the data