Module 1 – introduction

This module contains some introductory material and speech samples, to accompany the first lecture, which is an introduction to the course.
Log in

Module status: ready

Welcome

Welcome to the course! The first lecture will provide a detailed overview of this course, including:

  • who the course is designed for
  • the textbook
  • scope and structure
  • a brief history of speech synthesis
  • teaching mode and how to get the most out of this course, including using this website

Lectures

Lectures (actually a varied mixture of lecture material and in-class activities) will be held on campus. Simon King will give the lectures this year, with Korin Richmond leading the lab sessions.

Practical assignment

The assignment for this course involves recording speech data and building a unit selection voice. Labs will be held on-campus in the PPLS Computing Lab (Appleton Tower 4.02) on Linux desktop computers with all necessary software already installed. Remote access is possible, but please note that attendance at the scheduled lab sessions is expected, and is vital for success on this course.

Assumed background from the Speech Processing course

The Speech Synthesis course assumes you have previously taken Speech Processing. If you have not, first talk to the lecturer to obtain permission, then revise the following material from Speech Processing (items in bold are the most important):

Module 1: Introduction to the International Phonetic Alphabet

Module 2: Waveform; Spectrum; Spectrogram

Module 3: Time Domain; Sound Source; Periodic Signal; Pitch; Digital Signal; Short-term analysis; Series Expansion; Fourier Analysis; Frequency domain

Module 4: Harmonics;  Impulse train; Spectral envelope; Filter; Impulse response; Source-filter model

Module 5: Tokenisation & normalisation; Handwritten rules; Phonemes and allophones; Pronunciation; Prosody

Module 6: DiphoneWaveform concatenationOverlap-add; Pitch period; TD-PSOLA

Module 7: Feature vectors, sequences, and sequences of feature vectors; Pattern Matching, Alignment, Dynamic Time Warping

Modules 8, 9, 10: ideally, try to get some understanding of what a Hidden Markov Model is, but don’t worry if you don’t fully understand this material. You do not need to understand how Automatic Speech Recognition works.

1.1.1 Festival unit selection, fully automatic (uncorrected) voice build, professional speaker

1.1.2 Nuance online demo (the Vocalizer product)

1.1.3 Acapela online demo

1.2.1 Festival + HTS statistical parametric (HMM) waveform generation with the STRAIGHT vocoder (same front end, speaker and database as 1.1.1)

1.2.2 Festival + HTS + STRAIGHT, with acoustic model adaptation from an average voice model to around 30 minutes of speech from non-professional target speaker

1.3.1 Microsoft’s “trajectory tiling” hybrid waveform generation and the corresponding paper

1.3.2 Nuance’s “multiform” method, which alternates between unit selection and statistical parametric waveform generation
(example currently unavailable)

1.3.3 USTC / iFlyTek hybrid waveform generation and the corresponding paper

1.3.4 SHRC hybrid waveform generation and the corresponding paper

1.6.1 highly intelligible HMM-generated speech (HTS benchmark from Blizzard Challenge 2013)

1.6.2 unit selection speech from ILSP judged as highly natural by listeners in the Blizzard Challenges of 2011 and 2013, respectively

1.8 Google Tacotron samples

1.4.1 von Kempelen’s speaking machine

1.4.2 Dudley’s “Voder”

1.4.3 the UK speaking clock – unit selection from physical recordings

1.5.2 Diphone synthesis

1.5.3 – First-generation neural network synthesis (circa 2014)

Other material

Reading

Download the slides for the class on 2025-01-14

Right – you’re all ready to get started on the main course content. The course calendar tells you which modules are covered in which lecture. You need to complete all the videos and the essential readings for the specified modules in advance of each lecture, then come to the lecture armed with questions and ready for an interactive discussion.

Let’s finish off this module with an example application of text-to-speech synthesis: