This paper introduces the VALL-E model, which frames speech synthesis as a language modelling task in which a sequence of audio codec codes are generated conditionally, given a preceding sequence of text (and a speech prompt).
Zeghidour et al. SoundStream: An End-to-End Neural Audio Codec
There are various other similar neural codecs, including Encodec and the Descript Audio Codec, but SoundStream was one of the first and has the most complete description in this journal paper.
Taylor – Section 12.4 – Linear-Prediction Analysis
An overview of the background and maths behind linear-prediction methods for modelling the vocal tract as a filter.
Tachibana et al. Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention
DCTTS is comparable to Tacotron, but is faster because it uses non-recurrent architectures for the encoder and decoder.
Wu et al. Merlin: An Open Source Neural Network Speech Synthesis System
Merlin is a toolkit for building Deep Neural Network models for statistical parametric speech synthesis. It is a typical frame-by-frame approach, pre-dating sequence-to-sequence models.
Ren et al. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
FastSpeech 2 improves over FastSpeech by not requiring a complicated teacher-student training regime, but instead being trained directly on the data. It is very similar to FastPitch 2, which was released around the same by different authors.
Łańcucki. FastPitch: Parallel Text-to-speech with Pitch Prediction
Very similar to FastSpeech2, FastPitch has the advantage of an official Open Source implementation by the author (at NVIDIA).
King et al: Speech synthesis using non-uniform units in the Verbmobil project
Of purely historical interest, this is an example of a system using a heterogeneous unit type inventory, developed shortly before Hunt & Black published their influential paper.
What you should already know
Before continuing, you should check that you have the right background by watching this video.
Handbook of phonetic sciences – Ch 20 – Intro to Signal Processing for Speech (Sections 6-7)
Written for a non-technical audience, this gently introduces some key concepts in speech signal processing. Read sections 6-7.
Handbook of phonetic sciences – Ch 20 – Intro to Signal Processing for Speech (Sections 1-5)
Written for a non-technical audience, this gently introduces some key concepts in speech signal processing. Read sections 1-5 (up to and including ‘Fourier Analysis’).
Watts et al. Where do the improvements come from in sequence-to-sequence neural TTS?
A systematic investigation of the benefits of moving from frame-by-frame models to sequence-to-sequence models.