Speech Synthesis

Following on from the introductory material in Speech Processing, we move on to more sophisticated ways to generate the waveform, from unit selection to statistical parametric models. We also cover some more advanced speech signal processing.

This course is taught at the University of Edinburgh as the Speech Synthesis course, at advanced undergraduate and Masters levels. Students should normally have completed the Speech Processing course first, which includes material on the Text-to-Speech front end. In this Speech Synthesis course, the focus is mostly on waveform generation.

Weekly schedule
The calendar shows which module you need to complete before each week's lecture. It also lists lab times and specifies the coursework deadline.
Readings
You will find reading lists within each module. Here, you will find the same readings arranged into alphabetically-sorted lists, broken down by module or importance.
Module 1 - introduction
This module contains some introductory material and speech samples, to accompany the first lecture, which is an introduction to the course.
Module 2 - unit selection
Concatenating recordings of natural recorded speech waveforms can provide extremely natural synthetic speech. The core problem is how to select the most appropriate waveform fragments to concatenate.
Module 3 - unit selection target cost functions
The target cost is critical to choosing an appropriate unit sequence. Several different forms are possible, using linguistic features, or acoustic properties, or a combination of both.
Module 4 - the database
The quality of unit selection depends on good quality recorded speech, with accurate labels
Module 5 - evaluation
How do we know how good our synthesiser is? Can we use formal evaluation to decide how to improve it?
Module 6 - speech signal analysis & modelling
Epoch detection, F0 estimation and the spectral envelope. Representing them for modelling. We also consider aperiodic energy. Then, we can analyse and reconstruct speech: this is called vocoding.
Module 7 - Statistical Parametric Speech Synthesis
After establishing the key concepts and motivating this way of doing speech synthesis, we cover the Hidden Markov Model approach.
Module 8 - Deep Neural Networks
The use of neural networks is motivated by replacing the regression trees, which were used in the HMM approach, with a more powerful regression model.
Module 9 - sequence-to-sequence models
True sequence-to-sequence models improve over frame-by-frame models by encoding the entire input sequence then generating the entire output sequence
The state-of-the-art
The content of this part of the course is updated each year. We will cover the latest developments.

April 15, 2025	This video was Excellent Difficulty Just right Doing Text-to-Speech
April 15, 2025	This video was Excellent Difficulty Just right What is a Neural Network?
April 14, 2025	This video was Excellent Difficulty Just right Wrap-up
April 13, 2025	This video was Excellent Difficulty My brain hurts HMM speech synthesis, described as context-dependent modelling
April 13, 2025	This video was Excellent Difficulty My brain hurts HMM speech synthesis, described as context-dependent modelling

Speech Synthesis

Weekly schedule

Readings

Module 1 - introduction

Module 2 - unit selection

Module 3 - unit selection target cost functions

Module 4 - the database

Module 5 - evaluation

Module 6 - speech signal analysis & modelling

Module 7 - Statistical Parametric Speech Synthesis

Module 8 - Deep Neural Networks

Module 9 - sequence-to-sequence models

The state-of-the-art

Search the forums

Speech Synthesis

In the forums…

Latest video ratings