F0 estimation (part 1)

Whilst epochs are moments in time, the fundamental frequency (F0) is the rate of vibration of the vocal folds expressed in Hertz. It is generally estimated over a frame spanning several pitch periods, containing multiple epochs.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoSo moving on then to estimating F0.
Let's just remind ourselves of the difference between epoch detection and F0 estimation.
Epoch detection is finding something in the time domain - instants in time - perhaps they relate to when the vocal folds snap shut (the glottal closure instants).
That has uses in signal processing, specifically in pitch-synchronous signal processing.
We're now going to move on to F0 estimation, which is a related problem.
F0 is defined as the local rate of vibration of the vocal folds, expressed as a frequency in Hertz (cycles per second).
That's a parameter of speech.
It might be useful for the join cost in a unit selection synthesiser and, as we're going to see eventually, it's an important parameter in a vocoder for analysing speech into a parametric representation, which you might either manipulate or model, and then reconstructing the waveform from that representation.
So both epoch detection and F0 estimation have many different names in the literature.
I'm attempting to be consistent in my use of:
epoch detection for "pitch marking" or "GCI detection", and
F0 estimation for "pitch tracking", "FO tracking", and so on.
Now, since these are obviously very related (because they're from the same physical property of the signal, which comes from the same physical part of speech production: the vocal folds), surely it would make sense to do epoch detection and then just look at the duration between epochs.
That's the fundamental period.
Do "1 over that", then get the fundamental frequency.
So it's a perfectly reasonable question to ask: Can we estimate F0 after doing epoch detection?
Let's understand why that might not be the best thing to do.
Here are the epochs I detected with my very simple algorithm.
It doesn't matter that they're not lined up with the main peak in the signal.
Let's try and estimate F0 from these epochs.
We could do it very locally.
We could look at a pair of epochs.
We could measure the duration (time) that elapsed between them.
This is called the fundamental period, or T0, and that's just equal to 1/F0.
T0 is in seconds (s) and F0 is in Hertz (Hz).
That would be correct, however, that would be very prone to lots of local errors.
For example, imagine our epoch detector missed out this epoch.
It made a mistake.
We would then get an octave error here in F0; the fundamental period has doubled and we get an F0 halving error.
So we got lots of local errors, lots of noise in our estimate of F0.
A smarter way to estimate F0, given that we know it changes slowly, will be to estimate its value over rather longer time periods than single epochs.
Then we could be more robust to errors in epoch detection.
The method we're about to develop does essentially that, but it doesn't need to do epoch detection.
It works directly with the speech signal.
It avoids the step of epoch detection and the errors that that might make.
It works directly with a speech signal and looks, over some windows of time, for repeating patterns in the speech waveform.
So the key, then, is that when we estimate F0, we can do so by looking at rather longer time windows than one pitch period: multiple pitch periods.
Then we should be more robust, because the more signal we see, the more robust estimate of F0 we should get.
Our choice of window size will be dictated by how fast we think F0 can change and how fast it can turn on and off at boundaries between voiced and unvoiced speech.
So I just said we're going to attempt to find the periods - the fundamental periods - in this waveform without having to explicitly mark them as epochs.
The way we're going to do that is just look for repeating patterns in the waveform.
Visually, there's a very obvious repeating pattern in this waveform
This pattern repeats (approximately) and it slowly changes.
We're going to try and look for that.
Specifically, we're going to try and look for it repeating from one pitch period to the next.
We're going to try and measure the similarity between two consecutive pitch periods.
The method isn't going to just do that for one pair of pitch periods - for example, this one and this one - it's going to do it for this one and the next one and this one in the next one, over some window that we can choose.
So we're just going to look for general self-similarity between the waveform and itself,shifted in time.
The shift will be one period.
Now, we don't know the period, so we can't just shift the waveform by exactly a period and measure the similarity.
We're going to have to try shifting the waveform by many different amounts and find the place where it's most self-similar.
Let's do that.
I'll need a copy of the waveform and I'm going to put the waveforms on top of each other like that.
All I'm going to do is take one of the copies, and I'm going to slowly shift it to one side: I'm going to time-shift it.
The shift has got a special name in this form of signal processing: it's called the lag.
So let's just let one of them drift slowly to one side and see what happens.
From time to time we see little glimpses of self-similarity between the waveforms.
That was one there - it's fairly self-similar.
A little bit of a one there.
There's one coming up: that's a bit self-similar.
But right at this moment they were very similar to each other.
Let's just look at it in slow motion.
To be sure, we spotted the exact moment where there's a lot of self-similarity in the waveforms.
So at that moment, the shift between one waveform and the next is exactly one pitch period.
They're really very similar to each other; they're not identical, but very similar.
So how are we going to measure that similarity?
There's actually a very simple way of measuring the similarity between two signals. They could be two different signals or the signal and a shifted version of itself.
That's just to sample-by-sample multiply them together.
That's technically known as the inner product (that doesn't matter).
Sample-by-sample, we'll multiply them together.
If you think about it, that will give the biggest possible value when the signals are maximally aligned, because all the positive numbers (all the positive samples) will be multiplied by positive numbers and the negative ones be multiplied by negative numbers.
A negative number times a negative number is always positive, and all of that will give us the biggest possible value.
And so we can write that in this very simple equation.
Don't be put off by the number of different letters in this equation!
We're going to deconstruct it in a moment.
This is known as cross-correlation.
Sometimes it's called autocorrelation or "modified autocorrelation".
The differences between those are not really important to us; they're just to do with how big a window we use as we're calculating the self-similarity.
We're going to stick with the cross-correlation, which is defined like this.
This r is the cross-correlation value, or the function.
It has two parameters.
It should be fairly obvious that it varies depending where in the signal we calculate it, t : so, different point in the utterance, so t may vary.
And it's also going to vary - this is the key parameter - as we shift one signal with respect to the other; that's the lag, τ.
What we'll do, we'll vary τ and we'll try and find the value that gives the highest value of the cross-correlation (or what I've called here autocorrelation) function.
And that value of τ will be the fundamental period: the value that gives the biggest self-similarity.
We'll see that in a moment
And how is this defined?
We're going to sum speech sample, so x is the value of a sample in the waveform.
It's just a number stored in that raw waveform file and j indexes into time.
So that's the j'th sample in this utterance.
Let's say there might be 16 000 samples per second.
And we're going to multiply this sample by a shifted sample, a sample that's also at time j, plus τ (plus the shift).
So it's going to be one speech sample (that's one of the waveforms) multiplied by another speech sample from the other waveform
So what are all these other letters on the right hand side?
j indexes time and in this summation it's going to start at one particular time and count up to another particular time.
It's going to do that over a window.
The size of the window is W.
So W is window size in samples.
Let's look at how that really works for one particular time and one particular lag, and
calculate this autocorrelation (rather: modified autocorrelation or cross-correlation function - don't worry about the differences in those names!)
Here's a speech waveform.
I'll need a copy of it.
This is the one that I'm going to shift.
So the top one will give me those x_j samples and the bottom one will get shifted.
It'll have a lag.
That'll be my x_{j+τ}, which is the lag.
I'll do this over a fixed window, and the size of the window in samples is W.
We'll need to choose that value.
We're going to shift one of the waveforms by some lag.
Let's just do that.
I'll just pick some arbitrary lag for now, and just it for one value of τ.
So that waveform has just being shifted by the lag, and that's denoted by τ.
So we're now looking a bit further into the future of this waveform: it slid to the left, so we're looking at samples that were originally to the right.
Let's just write all the various parameters on that equation onto this diagram.
We've already seen that the window size is W.
The window is going to be where we multiply corresponding samples of the two waveforms and sum them up.
So the start time of the window was t+1 and the end time was t+W.
It looks a bit silly to use t+1, and not t, but the reason for that is so that there's exactly W samples inside this window.
We could have done t to t+W-1; it doesn't really matter.
What are the two samples that we're multiplying and adding together?
Well, we'll take one sample here that will be x_j and we'll take sample down here, that will will be x_{j+τ} because we shifted this one.
Just to be completely clear, it's just the value of the sample: so it's how high above zero axis it is.
We'll take those two things and we'll multiply them together and we'll do that for every corresponding pair of samples.
So we'll take this sample times the sample, plus the next sample times the next sample, plus the next sample times the next sample,... and we'll do that W times and then we'll sum them up.
Let's look again at the equation while we're talking about that.
In the top waveform, we were pulling out the x_j samples.
In the bottom waveform, we were pulling out the shifted versions: looking forwards in time by τ, so that it's x_{j+τ}.
Then we're just simply multiplying them together.
So x_j x_{j+τ} means multiply those two things together.
And then we were doing that for all pairs of samples across some window W and then summing the result together.
Let's make sure we fully understand the left hand side.
The most important parameter - and it's in parentheses to show it's the argument of the function - is τ.
What we're going to do in a moment, is we're going to calculate this function for several different values of τ starting at 0, and plot it.
We're going to look for peaks in that function.
There's obviously going to be a really big peak at τ=0 because that's just the signal multiplied sample by sample with itself with no lag: it's maximally self similar.
We'll get the maximum value there.
But, at a shift of one pitch period, we're hoping we'll also get another maximum of this function.
So the key parameter there is τ.
But of course, the function also depends on where exactly in the waveform we placed the window: where's the start time of the window?
So if we wanted to track F0 across a whole utterance, we would have to calculate this function and do this peak picking operation (we're going to do in a moment) at several different values of t.
Maybe would like the value of F0 every 10 milliseconds.
So we have to vary t from 0 up to the end of the utterance in steps of 10 ms and calculate F0 at each of those through this procedure that we're going to look at in a moment.
So here's a plot of a function in samples.
This is for waveform this actually being downsampled to a rather low sample rate.
If you want to investigate this for yourself, there's a blog post.
Go and find that: there's a spreadsheet and there's waveforms and you can calculate this exact function for yourself, step by step.
So I said the key parameter of the autocorrelation (or cross-correlation) function is τ (or lag).
At a lag of 0, that means the waveforms at the top on the bottom are not shifted with respect to each other, and so we're going to get a maximum possible value for this function.
Then we're going to set τ=1 and we're going to calculate the whole thing again.
So move the waveform one sample to the left, multiply and sum all the samples together and get the value of the function.
They'll be a little bit less similar, so the value will go down a bit.
And then we just plot that function for various different values of τ.
Now, over what range should we calculate it?
Well, we have an idea about what sort of values the pitch period might be: the fundamental period.
We vary τ from, let's say, zero up to well past the maximum value of the pitch period that we expect for this particular speaker.
I've done that here.
I've varied τ from zero up to a shift of 100 samples in this downsampled waveform (the sampling rate is much lower than 16 kHz actually).
Look what happens.
At another leg - this lag here - we get a really big peak: the waveforms are very self similar, even though one is shifted with respect to the other: that's the pitch period.
If we keep on shifting the waveform, eventually when there's a shift of 2 pitch periods, we'll get another peak in the self-similarity.
Hopefully, that will be smaller than the first peak.
So this autocorrelation function is a function of the lag, τ.
Let's just look again at the animation to reinforce that.
We took our waveform.
We made a copy of the waveform, not shifted it all; this is with a lag of 0.
We placed a window over which we're going to calculate this function.
Note that the window is spanning several epochs and it has many speech samples in it.
So we're estimating F0 from a lot of information here: a lot more than a few error-prone individual epoch marks.
Then we calculate the autocorrelation (strictly speaking: the cross-correlation) function between these two waveforms for the current value of τ, which is zero.
So we take this sample, multiply by this sample and add it to this times this plus this times this,... all the way through the waveform.
Because every sample has been essentially be multiplied by itself, we get the sum of some squared numbers which are all positive and get a nice big value.
All we do then is move the waveform one sample to the left and calculate that whole function again, and then another sample to the left and calculate again.
So, as this waveform slides to the left, at every shift of one sample we recalculate the autocorrelation function across this whole window.
So here τ is increasing.
The τ is going up and up and up up as we slide to the left.
So it's clear that this cross-correlation function is going to take a little bit computation because every individual point on it - which is for a particular value of τ - so this one's for a value of τ=40 samples of lag - just to calculate that one value involves sum over W samples within the window.
So we have to do W multiplies and add them all together to calculate that one value.
And then we have to repeat that for as many values of τ as we think we need, so that we'll find this big peak.
So now you understand why estimating F0 from a speech signal might be a bit slower than you were expecting, because we've got lots of computation to calculate each value of this plot, got W multiplies and then a summation.
We've got to do that τ times and we still got to actually identify F0, which is to find this big peak.
That gives us just one number, the value of F0 at this frame.
We're then going to move the window (the frame) along in time, maybe forwards 10 ms and do it all again.
We're going repeat that every 10 ms for a whole utterance.
So the general form of this algorithm for F0 estimation is looking a little bit like the algorithm for epoch detection.
There's a core to the algorithm.
That's what we're currently talking about.
The core of this algorithm is to calculate this correlation function and to make a plot of it and then to find the biggest peak in that plot.
So we're looking for the biggest peak, at not zero lag, because that's trivial: that's where the waveform is just equal to itself.
So some peak after that.
The way to locate that - we'll set some search range of expected lags (in other words, we know in advance what the range of F0 of our speaker is) and from that range, we can set the range of pitch periods from a minimum to a maximum.
We'll search over that range in the autocorrelation function and find the biggest peak in there and whatever the biggest peak is, that lag equals the pitch period T0 (sometimes called the fundamental period), and 1 over that = F0.
Now, we already said in epoch detection that peak picking is hard.
That's true.
So we expect to make errors when we're peak picking in the autocorrelation function.
Other things that make it non-trivial is that real signals are not perfectly periodic.
So as we shift one waveform with respect to the other, the pitch periods won't all line on top of each other perfectly.
F0 might be changing slowly.
Or there might be local effects such as some jitter in F0: the periods are getting longer and shorter locally.
As we saw, as we shifted the waveforms past each other there are some moments of self-similarity which happened at times other than exact multiples of the pitch period, and that's because of the formant structure of the waveforms.
So some of the peaks in the autocorrelation function - these ones here - are actually due to the structure within a pitch period, giving us some spurious self-similarity that's not the pitch period.
That's because of the formant structure of speech.
So just like epoch detection, we'll take the core idea, which is a good one.
Here, the core idea is cross-correlation (or modified autocorrelation) and peak picking.
These are straightforward things, but we're going to have to dress them up with some pre- and post-processing to deal with their limitations.
So, like most signal posting algorithms, we might consider doing some pre-processing or post-processing.
The pre-processing will be to try and make the problem simpler before we hand over to the main part of the algorithm, which here is going to be cross-correlation and peak picking, and the post-processing will be to fix up any errors that that core algorithm makes.