Pitch period

This fundamental building block of speech waveforms offers a route to source-filter separation in the time domain.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoWe're going to talk a bit more about the concept of the pitch period.
We're going to remind ourselves how that relates to the source-filter model.
We take an impulse train and we pass it through a filter with resonances.
We can generate speech with the source-filter model.
One way to understand the filter is through its impulse response.
That is, we put a single impulse into the filter and observe the waveform that we get out.
That waveform is a pitch period.
Using this idea, we'll devise a way to break a natural speech signal down into a sequence of pitch periods, which later we can use to manipulate that speech signal.
Here's a short fragment of voiced natural speech.
The source-filter model tells us that this was generated using a simple excitation signal from the vocal folds (idealised as an impulse train in the model) passed through the vocal tract filter.
So this signal is a sequence of vocal tract impulse responses.
We need to separate the source and the filter so that we can manipulate them separately.
For example, to modify the fundamental frequency (F0) without changing the identity of the phone, we just need to manipulate the source and leave the filter alone.
We need to identify the filter from this signal so that we can preserve it.
One way to do that, of course, would be to fit a source-filter model and solve for the filter coefficients.
But remember that there are several ways to represent the filter, all containing the same information but in different domains.
Those coefficients (of the difference equation) are just one representation.
We could also talk about the frequency response of the filter, or about its impulse response.
The impulse response exists in the time domain.
So there is a way to get hold of the filter right here in the time domain.
We're going to use the impulse responses as our representation of the vocal tract filter.
That will require us to find the fundamental periods of the speech, so we can find those impulse responses from this waveform.
To find the fundamental periods we need to place pitch marks on the speech.
Pitch marks are estimates of epochs, which are the moments of vocal fold closure.
Pitch marking can be done automatically using methods which are beyond the scope of this video.
Here's a short utterance and its pitch marks.
It sounds like this: 'Nothing's impossible.'
Let's take a closer look.
If we look at this region, we see that the speech is transitioning from voiced to unvoiced.
So what do we do there about the pitch marks?
We still need to break this speech down into pitch periods, even when there is no fundamental frequency.
In other words, we need to find analysis frames.
So we'll just revert to a fixed frame rate - a fixed value of the fundamental period - and that's equivalent to just placing evenly-spaced pitch marks through the unvoiced regions.
Zoom back out and look a different part of the utterance: the end, where the speech is finishing and we end up in silence.
We'll see we also need to place pitch marks here, so that we can break this signal down into short parts.
So we also need to place pitch marks in silence.
In signal processing, silence is typically treated just like the rest of the signal.
Now we've placed pitch marks on our signal.
Aligning them precisely with the true epochs - which we don't have access to - is a little bit hard, although you can see that we can at least place one pitch mark in each fundamental period, and that's good enough.
Now the vocal tract filter has a potentially infinite impulse response, due to vocal tract resonance.
That means that one impulse response will generally not have decayed away to zero before the next one starts: they overlap.
That's particularly obvious in this signal, which is speech from a female speaker.
The fundamental period is about 4 ms, corresponding to an F0 of about 250 Hz.
Quite clearly, the impulse response has not decayed to zero before the next impulse starts.
The impulse responses overlap.
Because the pitch marks might not be precisely at the true epoch locations, and because the impulse responses overlap, we can't naively cut this waveform into individual impulse responses.
There's no place where we can do that.
Instead, we'll remember something we learned earlier about short-term analysis.
We will place an analysis frame - centred on each pitch mark - and apply a tapered window and use that to extract the pitch periods.
We'll find each epoch; we'll place an analysis frame around it; we';l extract a pitch period like that.
We'll do that for every pitch mark in the utterance to get a sequence of pitch periods.
Typically, we make the duration of these twice the fundamental period: 2 x T0.
Sometimes we just say 'two pitch periods'.
These look very much like the overlapping frames of a typical short-term analysis technique.
In fact, it's exactly the same.
The only difference here is that we're placing the frames pitch-synchronously and we're varying their duration in proportion to the fundamental period: we're making it 2 x T0.
These little pitch period building blocks capture only vocal tract filter information.
These little pitch period building blocks are going to be now used to synthesise speech signals by concatenating them using overlap-add.
Let's confirm then, that overlap-add of these pitch period does indeed reconstruct the original signal.
These pitch periods were extracted using a very simple triangular window.
So, if we overlap-add, everything adds back together correctly and we get almost the original signal back.
Let's listen to the original whole utterance from which this fragment was taken: 'Nothing's impossible.'
The reconstructed waveform from which the bottom fragment has come: 'Nothing's impossible.'
If you listen carefully on headphones, you'll hear some small artefacts.
But the waveform was reconstructed pretty well.
I've reconstructed this waveform without making any modifications to it, just to prove that I can decompose a speech signal into pitch periods and put it back together again.
This process is often called 'copy synthesis'.
We've just seen a new form of signal processing, which is pitch-synchronous.
It's essentially just the same as short-term analysis that we've seen before, except that we align the analysis frames to the fundamental periods of the signal and vary the analysis frame duration according to the fundamental period.
This video is called 'Pitch period', but we had to actually generalise that concept a little bit because the impulse responses of the vocal tract typically overlap in natural speech signals.
That makes it impossible to extract a single impulse response.
The generalisation we made was to extract overlapping frames and apply a tapered window in a way that makes reconstruction of the signal possible by simply using overlap-add.
This representation of the speech signal - as a sequence of pitch periods - is at the heart of the TD-PSOLA method, which can modify F0 and duration of speech signals in the time domain, using waveforms directly.
We're also going to be able to understand the interaction of the source and filter in the time domain, which is a process known as convolution.