Module status: ready
Welcome
Welcome to the course! The first lecture will provide a detailed overview of this course, including:
- who the course is designed for
- the textbook
- scope and structure
- a brief history of speech synthesis
- teaching mode and how to get the most out of this course, including using this website
Lectures
Lectures (actually a varied mixture of lecture material and in-class activities) will be held on campus. Simon King will give the lectures this year, with Korin Richmond leading the lab sessions.
Practical assignment
The assignment for this course involves recording speech data and building a unit selection voice. Labs will be held on-campus in the PPLS Computing Lab (Appleton Tower 4.02) on Linux desktop computers with all necessary software already installed. Remote access is possible, but please note that attendance at the scheduled lab sessions is expected, and is vital for success on this course.
Assumed background from the Speech Processing course
The Speech Synthesis course assumes you have previously taken Speech Processing. If you have not, first talk to the lecturer to obtain permission, then revise the following material from Speech Processing (items in bold are the most important):
Module 1: Introduction to the International Phonetic Alphabet
Module 2: Waveform; Spectrum; Spectrogram
Module 3: Time Domain; Sound Source; Periodic Signal; Pitch; Digital Signal; Short-term analysis; Series Expansion; Fourier Analysis; Frequency domain
Module 4: Harmonics; Impulse train; Spectral envelope; Filter; Impulse response; Source-filter model
Module 5: Tokenisation & normalisation; Handwritten rules; Phonemes and allophones; Pronunciation; Prosody
Module 6: Diphone; Waveform concatenation; Overlap-add; Pitch period; TD-PSOLA
Module 7: Feature vectors, sequences, and sequences of feature vectors; Pattern Matching, Alignment, Dynamic Time Warping
Modules 8, 9, 10: ideally, try to get some understanding of what a Hidden Markov Model is, but don’t worry if you don’t fully understand this material. You do not need to understand how Automatic Speech Recognition works.
1.1.1 Festival unit selection, fully automatic (uncorrected) voice build, professional speaker
1.1.2 Nuance online demo (the Vocalizer product)
1.1.3 Acapela online demo
1.2.1 Festival + HTS statistical parametric (HMM) waveform generation with the STRAIGHT vocoder (same front end, speaker and database as 1.1.1)
1.2.2 Festival + HTS + STRAIGHT, with acoustic model adaptation from an average voice model to around 30 minutes of speech from non-professional target speaker
1.3.1 Microsoft’s “trajectory tiling” hybrid waveform generation and the corresponding paper
1.3.2 Nuance’s “multiform” method, which alternates between unit selection and statistical parametric waveform generation
(example currently unavailable)
1.3.3 USTC / iFlyTek hybrid waveform generation and the corresponding paper
1.3.4 SHRC hybrid waveform generation and the corresponding paper
1.6.1 highly intelligible HMM-generated speech (HTS benchmark from Blizzard Challenge 2013)
1.6.2 unit selection speech from ILSP judged as highly natural by listeners in the Blizzard Challenges of 2011 and 2013, respectively
1.4.1 von Kempelen’s speaking machine
1.4.2 Dudley’s “Voder”
1.4.3 the UK speaking clock – unit selection from physical recordings
1.5.2 Diphone synthesis
1.5.3 – First-generation neural network synthesis (circa 2014)
Other material
Reading
Taylor – Chapter 3 – The text-to-speech problem
Discusses the differences between spoken and written forms of language, and describes the structure of a typical TTS system.
Clark et al: Festival 2 – build your own general purpose unit selection speech synthesiser
Discusses some of the design choices made when writing Festival's unit selection engine (Multisyn) and the tools for building new voices.
Download the slides for the class on 2025-01-14
Right – you’re all ready to get started on the main course content. The course calendar tells you which modules are covered in which lecture. You need to complete all the videos and the essential readings for the specified modules in advance of each lecture, then come to the lecture armed with questions and ready for an interactive discussion.
Let’s finish off this module with an example application of text-to-speech synthesis: