An overview of the background and maths behind linear-prediction methods for modelling the vocal tract as a filter.
Tachibana et al. Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention
DCTTS is comparable to Tacotron, but is faster because it uses non-recurrent architectures for the encoder and decoder.
Ren et al. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
FastSpeech 2 improves over FastSpeech by not requiring a complicated teacher-student training regime, but instead being trained directly on the data. It is very similar to FastPitch 2, which was released around the same by different authors.
King et al: Speech synthesis using non-uniform units in the Verbmobil project
Of purely historical interest, this is an example of a system using a heterogeneous unit type inventory, developed shortly before Hunt & Black published their influential paper.
Handbook of phonetic sciences – Ch 20 – Intro to Signal Processing for Speech (Sections 6-7)
Written for a non-technical audience, this gently introduces some key concepts in speech signal processing. Read sections 6-7.
Handbook of phonetic sciences – Ch 20 – Intro to Signal Processing for Speech (Sections 1-5)
Written for a non-technical audience, this gently introduces some key concepts in speech signal processing. Read sections 1-5 (up to and including ‘Fourier Analysis’).
Watts et al: From HMMs to DNNs: where do the improvements come from?
Measures the relative contributions of the key differences in the regression model, state vs. frame predictions, and separate vs. combined stream predictions.
Pollet & Breen: Synthesis by Generation and Concatenation of Multiform Segments
Another way to combine waveform concatenation and SPSS is to alternate between waveform fragments and vocoder-generated waveforms.
Qian et al: A Unified Trajectory Tiling Approach to High Quality Speech Rendering
The term “trajectory tiling” means that trajectories from a statistical model (HMMs in this case) are not input to a vocoder, but are “covered over” or “tiled” with waveform fragments.
Wu et al: Deep neural networks employing Multi-Task Learning…
Some straightforward, but effective techniques to improve the performance of speech synthesis using simple feedforward networks.
Taylor – Section 12.7 – Pitch and epoch detection
Only an outline of the main approaches, with little technical detail. Useful as a summary of why these tasks are harder than you might think.
King: Measuring a decade of progress in Text-to-Speech
A distillation of the key findings of the first 10 years of the Blizzard Challenge.