Cho & Ladefoged – Variation and universals in VOT: evidence from 18 languages

Voice onset time (VOT) is known to vary with place of articulation.

Carr – English Phonetics and Phonology: An Introduction – Ch 5 – The Phonemic Principle

Takes you from phonetics (which is about sound) to phonology (which is about mental representation and organisation into categories).

Ladefoged & Johnson – A course in phonetics – Chapter 8 – Acoustic phonetics

Links the source-filter model to spectrograms and acoustic analysis of speech.

Introduction to the IPA from the Handbook of the International Phonetic Association

Describes the aims of the International Phonetic Alphabet and its various uses.

Practical Phonetics

Videos for the course Practical Phonetics

Normal Speech Articulation

X-ray movies of speech

Seeing Speech

Interactive IPA chart

Watts et al. Where do the improvements come from in sequence-to-sequence neural TTS?

A systematic investigation of the benefits of moving from frame-by-frame models to sequence-to-sequence models.

Shen et al: Natural TTS Synthesis By Conditioning Wavenet On Mel Spectrogram Predictions

Tacotron 2 was one of the most successful sequence-to-sequence models for text-to-speech of its time and inspired many subsequent models.

Watts et al: From HMMs to DNNs: where do the improvements come from?

Measures the relative contributions of the key differences in the regression model, state vs. frame predictions, and separate vs. combined stream predictions.

Pollet & Breen: Synthesis by Generation and Concatenation of Multiform Segments

Another way to combine waveform concatenation and SPSS is to alternate between waveform fragments and vocoder-generated waveforms.

Qian et al: A Unified Trajectory Tiling Approach to High Quality Speech Rendering

The term “trajectory tiling” means that trajectories from a statistical model (HMMs in this case) are not input to a vocoder, but are “covered over” or “tiled” with waveform fragments.