Total video to watch in this module: 63 minutes
Using our knowledge of speech production, we predict that making joins at mid-phone positions will sound better than at phone boundaries.
In this video, I occasionally refer to a second screen displaying a waveform; that screen was not recorded. I think most of the content of this video still makes sense.
Time-domain pitch-synchronous overlap-and-add is a remarkably simple but effective way to independently modify the duration and F0 of speech.
To understand the power of linear predictive waveform coding, we'll consider the problem of smoothing the joins in concatenative synthesis.
It's just a simple equation, but for this course we don't need to get too deep into the maths.
The notation used in the linear prediction equation is slightly different from that used in class. It’s only a change of notation: the equation is otherwise exactly the same.
Now we understand how linear predictive coding works, we can use it to smooth the spectral envelope across joins.
Since the filter is time-varying, we need to decide how frequently (and at what moments in time) to update its co-efficients.
Exciting the filter with a simple pulse train doesn't produce good quality. Fortunately, there is an almost-perfect excitation signal: the residual.