A review article that makes some useful connections between HMM-based speech synthesis and unit selection.
Zen et al: Statistical parametric speech synthesis using deep neural networks
The first paper that re-introduced the use of (Deep) Neural Networks in speech synthesis.
Zeghidour et al. SoundStream: An End-to-End Neural Audio Codec
There are various other similar neural codecs, including Encodec and the Descript Audio Codec, but SoundStream was one of the first and has the most complete description in this journal paper.
Wu et al. Merlin: An Open Source Neural Network Speech Synthesis System
Merlin is a toolkit for building Deep Neural Network models for statistical parametric speech synthesis. It is a typical frame-by-frame approach, pre-dating sequence-to-sequence models.
Wu et al: Deep neural networks employing Multi-Task Learning…
Some straightforward, but effective techniques to improve the performance of speech synthesis using simple feedforward networks.
Watts et al. Where do the improvements come from in sequence-to-sequence neural TTS?
A systematic investigation of the benefits of moving from frame-by-frame models to sequence-to-sequence models.
Watts et al: From HMMs to DNNs: where do the improvements come from?
Measures the relative contributions of the key differences in the regression model, state vs. frame predictions, and separate vs. combined stream predictions.
Wang et al. Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
This paper introduces the VALL-E model, which frames speech synthesis as a language modelling task in which a sequence of audio codec codes are generated conditionally, given a preceding sequence of text (and a speech prompt).
Taylor – Section 17.2 – Evaluation
Testing of the system by the developers, as well as via listening tests.
Taylor – Section 17.1 – Databases
Including the important issue of labelling the data
Taylor – Section 12.7 – Pitch and epoch detection
Only an outline of the main approaches, with little technical detail. Useful as a summary of why these tasks are harder than you might think.
Taylor – Section 12.4 – Linear-Prediction Analysis
An overview of the background and maths behind linear-prediction methods for modelling the vocal tract as a filter.