This paper introduces the VALL-E model, which frames speech synthesis as a language modelling task in which a sequence of audio codec codes are generated conditionally, given a preceding sequence of text (and a speech prompt).
Zeghidour et al. SoundStream: An End-to-End Neural Audio Codec
There are various other similar neural codecs, including Encodec and the Descript Audio Codec, but SoundStream was one of the first and has the most complete description in this journal paper.
Łańcucki. FastPitch: Parallel Text-to-speech with Pitch Prediction
Very similar to FastSpeech2, FastPitch has the advantage of an official Open Source implementation by the author (at NVIDIA).
Shen et al: Natural TTS Synthesis By Conditioning Wavenet On Mel Spectrogram Predictions
Tacotron 2 was one of the most successful sequence-to-sequence models for text-to-speech of its time and inspired many subsequent models.
Zen et al: Statistical parametric speech synthesis using deep neural networks
The first paper that re-introduced the use of (Deep) Neural Networks in speech synthesis.
Ling et al: Deep Learning for Acoustic Modeling in Parametric Speech Generation
A key review article.
King: A beginners’ guide to statistical parametric speech synthesis
A deliberately gentle, non-technical introduction to the topic. Every item in the small and carefully-chosen bibliography is worth following up.
Talkin: A Robust Algorithm for Pitch Tracking (RAPT)
The classic algorithm for estimating F0 from speech signals.
Bennett: Large Scale Evaluation of Corpus-based Synthesisers
An analysis of the first Blizzard Challenge, which is an evaluation of speech synthesisers using a common database.
Kominek & Black: CMU ARCTIC databases for speech synthesis
Widely used, copyright-free speech databases for use in speech synthesis
Taylor – Section 17.2 – Evaluation
Testing of the system by the developers, as well as via listening tests.
Taylor – Section 17.1 – Databases
Including the important issue of labelling the data