Module 5 – speech synthesis – waveform generation

Manipulating recorded speech signals to create new utterances.
Log in

First, remind yourself about the architecture of a text-to-speech system: a front-end linguistic processor, followed by a waveform generator.

The first method we’ll use for waveform generation is Time-Domain Pitch Synchronous Overlap and Add (TD-PSOLA), so let’s see first how that works in practice.

If we wish to modify the spectral envelope, then we need a more powerful technique than TD-PSOLA, and so we’ll then look at linear prediction, which can manipulate source and filter separately.

Download the slides for the module 3, 4 and 5 videos

Download the additional slides for the class on 2019-10-24 : Module 5

Total video to watch in this module: 63 minutes

Using our knowledge of speech production, we predict that making joins at mid-phone positions will sound better than at phone boundaries.

In this video, I occasionally refer to a second screen displaying a waveform; that screen was not recorded. I think most of the content of this video still makes sense.

Log in if you want to mark this as completed

Time-domain pitch-synchronous overlap-and-add is a remarkably simple but effective way to independently modify the duration and F0 of speech.
Log in if you want to mark this as completed
To understand the power of linear predictive waveform coding, we'll consider the problem of smoothing the joins in concatenative synthesis.
Log in if you want to mark this as completed
It's just a simple equation, but for this course we don't need to get too deep into the maths.

The notation used in the linear prediction equation is slightly different from that used in class. It’s only a change of notation: the equation is otherwise exactly the same.

Log in if you want to mark this as completed

Now we understand how linear predictive coding works, we can use it to smooth the spectral envelope across joins.
Log in if you want to mark this as completed
Since the filter is time-varying, we need to decide how frequently (and at what moments in time) to update its co-efficients.
Log in if you want to mark this as completed
Exciting the filter with a simple pulse train doesn't produce good quality. Fortunately, there is an almost-perfect excitation signal: the residual.
Log in if you want to mark this as completed

This concludes the speech synthesis material for this course. You should have been doing all of the essential readings as you went along. Now is a good time to go back and complete as many of the recommended readings as you can, and perhaps some of the extra readings on topics that are of particular interest to you.