Sharon Goldwater: Vectors and their uses
A nice, self-contained introduction to vectors and why they are a useful mathematical concept. You should consider this reading ESSENTIAL if you haven't studied vectors before (or it's been a while).
Schaedler - Seeing Circles, Sines and Signals
A very nice concise primer on the basic components of digital signal processing with great visual demonstrations.
Introduction to the IPA from the Handbook of the International Phonetic Association
Describes the aims of the International Phonetic Alphabet and its various uses.
Carr - English Phonetics and Phonology: An Introduction - Ch 5 - The Phonemic Principle
Takes you from phonetics (which is about sound) to phonology (which is about mental representation and organisation into categories).
Cho & Ladefoged - Variation and universals in VOT: evidence from 18 languages
Voice onset time (VOT) is known to vary with place of articulation.
Vaux & Samuels - Explaining vowel systems: dispersion theory vs natural selection
Cross-linguistic distribution of vowel systems
Peterson & Barney - Control Methods Used in a Study of the Vowels
Examines the performance of both speakers and listeners. A classic paper!
Sharon Goldwater: Basic probability theory
An essential primer on this topic. You should consider this reading ESSENTIAL if you haven't studied probability before or it's been a while. We're adding this the readings in Module 7 to give you some time to look at it before we really need it in Module 9 - mostly we need the concepts of conditional probability and conditional independence.
Young et al: Token Passing
My favourite way of understanding how the Viterbi algorithm is applied to HMMs. Can also be helpful in understanding search for unit selection speech synthesis.
Furui et al: Fundamental Technologies in Modern Speech Recognition
A complete issue of IEEE Signal Processing Magazine. Although a few years old, this is still a very useful survey of current techniques.
Clark et al: Festival 2 - build your own general purpose unit selection speech synthesiser
Discusses some of the design choices made when writing Festival's unit selection engine (Multisyn) and the tools for building new voices.
Clark et al: Multisyn: Open-domain unit selection for the Festival speech synthesis system
A description of the implementation and evaluation of Festival's unit selection engine, called Multisyn.
King et al: Speech synthesis using non-uniform units in the Verbmobil project
Of purely historical interest, this is an example of a system using a heterogeneous unit type inventory, developed shortly before Hunt & Black published their influential paper.
Hunt & Black: Unit selection in a concatenative speech synthesis system using a large speech database
The classic description of unit selection, described as a search through a network.
Zen, Black & Tokuda: Statistical parametric speech synthesis
A review article that makes some useful connections between HMM-based speech synthesis and unit selection.
Kominek & Black: CMU ARCTIC databases for speech synthesis
Widely used, copyright-free speech databases for use in speech synthesis
Fitt & Isard: Synthesis of regional English using a keyword lexicon
An extension and practical application of Wells' keyvowels idea, which enables efficient generation of a pronunciation dictionary tailored to a specific accent or speaker.
Bennett: Large Scale Evaluation of Corpus-based Synthesisers
An analysis of the first Blizzard Challenge, which is an evaluation of speech synthesisers using a common database.
Benoît et al: The SUS test
A method for evaluating the intelligibility of synthetic speech, which avoids the ceiling effect.
Clark et al: Statistical analysis of the Blizzard Challenge 2007 listening test results
Explains the types of statistical tests that are employed in the Blizzard Challenge. These are deliberately quite conservative. For example, MOS data is correctly treated as ordinal. Also includes a Multi-Dimensional Scaling (MDS) section that is not as widely used as the other types of analysis.
Mayo et al: Multidimensional scaling of listener responses to synthetic speech
Multi-dimensional scaling is a way to uncover the different perceptual dimensions that listeners use, when rating synthetic speech.
Norrenbrock et al: Quality prediction of synthesised speech...
Although standard speech quality measures such as PESQ do not work well for synthetic speech, specially constructed methods do work to some extent.
Talkin: A Robust Algorithm for Pitch Tracking (RAPT)
The classic algorithm for estimating F0 from speech signals.
Kawahara et al: Restructuring speech representations...
The key paper about the STRAIGHT vocoder, which was originally intended for manipulating recorded natural speech.
King: A beginners’ guide to statistical parametric speech synthesis
A deliberately gentle, non-technical introduction to the topic. Every item in the small and carefully-chosen bibliography is worth following up.
King: Measuring a decade of progress in Text-to-Speech
A distillation of the key findings of the first 10 years of the Blizzard Challenge.
Nielsen: Neural Networks and Deep Learning
A great introduction. Relatively light on maths, and with some interactive explanations.
Gurney: An introduction to neural networks
Somewhat old, but might be helpful in getting some of the basic concepts clear, if you find Nielsen's "Neural Networks and Deep Learning" too difficult to start with.
Ling et al: Deep Learning for Acoustic Modeling in Parametric Speech Generation
A key review article.
Ren et al. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
FastSpeech 2 improves over FastSpeech by not requiring a complicated teacher-student training regime, but instead being trained directly on the data. It is very similar to FastPitch 2, which was released around the same by different authors.
Wang et al. Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
This paper introduces the VALL-E model, which frames speech synthesis as a language modelling task in which a sequence of audio codec codes are generated conditionally, given a preceding sequence of text (and a speech prompt).
Zen et al: Statistical parametric speech synthesis using deep neural networks
The first paper that re-introduced the use of (Deep) Neural Networks in speech synthesis.
Qian et al: A Unified Trajectory Tiling Approach to High Quality Speech Rendering
The term "trajectory tiling" means that trajectories from a statistical model (HMMs in this case) are not input to a vocoder, but are "covered over" or "tiled" with waveform fragments.
Zeghidour et al. SoundStream: An End-to-End Neural Audio Codec
There are various other similar neural codecs, including Encodec and the Descript Audio Codec, but SoundStream was one of the first and has the most complete description in this journal paper.
Łańcucki. FastPitch: Parallel Text-to-speech with Pitch Prediction
Very similar to FastSpeech2, FastPitch has the advantage of an official Open Source implementation by the author (at NVIDIA).
Shen et al: Natural TTS Synthesis By Conditioning Wavenet On Mel Spectrogram Predictions
Tacotron 2 was one of the most successful sequence-to-sequence models for text-to-speech of its time and inspired many subsequent models.
Tachibana et al. Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention
DCTTS is comparable to Tacotron, but is faster because it uses non-recurrent architectures for the encoder and decoder.
Wu et al. Merlin: An Open Source Neural Network Speech Synthesis System
Merlin is a toolkit for building Deep Neural Network models for statistical parametric speech synthesis. It is a typical frame-by-frame approach, pre-dating sequence-to-sequence models.
Watts et al. Where do the improvements come from in sequence-to-sequence neural TTS?
A systematic investigation of the benefits of moving from frame-by-frame models to sequence-to-sequence models.
Wu et al: Deep neural networks employing Multi-Task Learning...
Some straightforward, but effective techniques to improve the performance of speech synthesis using simple feedforward networks.
Pollet & Breen: Synthesis by Generation and Concatenation of Multiform Segments
Another way to combine waveform concatenation and SPSS is to alternate between waveform fragments and vocoder-generated waveforms.
Watts et al: From HMMs to DNNs: where do the improvements come from?
Measures the relative contributions of the key differences in the regression model, state vs. frame predictions, and separate vs. combined stream predictions.
Other readings
Individual readings, papers, etc.