reading

Zen, Black & Tokuda: Statistical parametric speech synthesis

A review article that makes some useful connections between HMM-based speech synthesis and unit selection.

Zen et al: Statistical parametric speech synthesis using deep neural networks

The first paper that re-introduced the use of (Deep) Neural Networks in speech synthesis.

Zeghidour et al. SoundStream: An End-to-End Neural Audio Codec

There are various other similar neural codecs, including Encodec and the Descript Audio Codec, but SoundStream was one of the first and has the most complete description in this journal paper.

Wu et al. Merlin: An Open Source Neural Network Speech Synthesis System

Merlin is a toolkit for building Deep Neural Network models for statistical parametric speech synthesis. It is a typical frame-by-frame approach, pre-dating sequence-to-sequence models.

Wu et al: Deep neural networks employing Multi-Task Learning…

Some straightforward, but effective techniques to improve the performance of speech synthesis using simple feedforward networks.

Watts et al. Where do the improvements come from in sequence-to-sequence neural TTS?

A systematic investigation of the benefits of moving from frame-by-frame models to sequence-to-sequence models.

Watts et al: From HMMs to DNNs: where do the improvements come from?

Measures the relative contributions of the key differences in the regression model, state vs. frame predictions, and separate vs. combined stream predictions.

Wang et al. Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

This paper introduces the VALL-E model, which frames speech synthesis as a language modelling task in which a sequence of audio codec codes are generated conditionally, given a preceding sequence of text (and a speech prompt).