The first paper that re-introduced the use of (Deep) Neural Networks in speech synthesis.
Zeghidour et al. SoundStream: An End-to-End Neural Audio Codec
There are various other similar neural codecs, including Encodec and the Descript Audio Codec, but SoundStream was one of the first and has the most complete description in this journal paper.
Wang et al. Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
This paper introduces the VALL-E model, which frames speech synthesis as a language modelling task in which a sequence of audio codec codes are generated conditionally, given a preceding sequence of text (and a speech prompt).
Taylor – Section 17.2 – Evaluation
Testing of the system by the developers, as well as via listening tests.
Taylor – Section 17.1 – Databases
Including the important issue of labelling the data
Taylor – Chapter 3 – The text-to-speech problem
Discusses the differences between spoken and written forms of language, and describes the structure of a typical TTS system.
Taylor – Chapter 16 – Unit-selection synthesis
A substantial chapter covering target cost, join cost and search.
Talkin: A Robust Algorithm for Pitch Tracking (RAPT)
The classic algorithm for estimating F0 from speech signals.
Shen et al: Natural TTS Synthesis By Conditioning Wavenet On Mel Spectrogram Predictions
Tacotron 2 was one of the most successful sequence-to-sequence models for text-to-speech of its time and inspired many subsequent models.
Łańcucki. FastPitch: Parallel Text-to-speech with Pitch Prediction
Very similar to FastSpeech2, FastPitch has the advantage of an official Open Source implementation by the author (at NVIDIA).
Ling et al: Deep Learning for Acoustic Modeling in Parametric Speech Generation
A key review article.
Kominek & Black: CMU ARCTIC databases for speech synthesis
Widely used, copyright-free speech databases for use in speech synthesis