Cepstral Analysis, mel-scale Filterbanks, MFCCs

We now start thinking about what a good representation of the acoustic signal should be, motivating the use of Mel-Frequency Cepstral Coefficients (MFCCs).

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN AUTOMATIC TRANSCRIPT WITH LIGHT CORRECTIONS TO THE TECHNICAL TERMS
So on with the remainder of the class, which is all about feature engineering
we'll start with the magnitude spectrum on the filterbank features that we've seen before in the last module.
Quick recap.
We extract frames from a speech waveform.
This is in the time domain.
We extract short term analysis frames
[and use a tapered window] to avoid discontinuities at the edges
we apply the discrete Fourier transform and we get the magnitude spectrum.
This is written on a logarithmic scale.
So this is the log magnitude spectrum, and from that we extract filterbank features.
filterbank features are found from a single frame of speech in the magnitude spectrum domain.
We divide it into bands spaced on the mel scale.
Sum up the energy in each band and write that into element of the feature vector
typically we use triangular-shaped filters and space their centres and their widths on the mel scale so they get further and further apart and wider and wider as we go up frequency in Hz
and the energy in each of those into the energy
sum this one is written into the first element of the feature vector: a multi-dimensional feature vector.
Now let's make it really clear that the features in this feature vector will exhibit a lot of covariance: they're highly correlated with each other
That will become very obvious when we look at this animation.
So remember that each band of frequencies is going into the elements of this feature vector extracted with these triangular filters.
So I will to play the animation.
Just look at energy in adjacent frequency bands.
Clearly the energy in this band and the energy in this band are going up and down together
see that when this one goes up, this one tends to go up.
When this one goes down, this one tends to go down.
They're highly correlated.
They covary because they're adjacent and the spectral envelope is a smooth thing and so these features are highly correlated.
If we wanted to model this feature vector with the Multivariate Gaussian, it will be important to have a full covariance matrix to capture that.
So the filterbank energies themselves are perfectly good features unless we want to model with a diagonal covariance Gaussian, which is what we've decided to do.
So we're going to do some feature engineering to get around that problem
we're going to build up to some features called Mel Frequency Ceptral Coefficients - or MFCCs as they are usually called
they take inspiration from a number of different directions
One strong way of motivating them is to actually go back and think about speech production again and to remember what we learned a little while ago about convolution.
Knowing what we know about convolution, we're going to derive a form of analysis called Cepstral analysis.
Cepstral is an anagram of spectral: it's a transform of spectral.
Here's a recap that convolution in the time domain (convolution of waveforms) is equivalent to addition in the log magnitude spectrum domain.
So just put aside the idea of the filterbank for a moment.
Let's go right back to the time domain and start again.
This is our idealised source.
This is the impulse response of the vocal tract filter
if we convolve those (that's what star means), we'll get our speech signal in the time domain
that's equivalent to transforming each of those into the spectral domain and plotting on a log scale
And then we see that log magnitude spectra add together.
So this becomes addition in the log magnitude spectrum domain
convolution is a complicated operation
We might imagine (perfectly reasonably) that convolution is very hard
in other words, given that, to go backwards and decompose it into source and filter from the time domain is hard.
But we could imagine that undoing an addition is rather easier.
And that's exactly what we're about to do.
How do we get from the time domain to the frequency domain?
We use the Fourier transform.
The Fourier transform is just a series expansion.
So to get from this time domain signal to this frequency domain signal we did a transform that's a series expansion
that series expansion had the effect of turning this axis from time in units of seconds to this axis in frequency, which has units of what? Hz.
But Hz = 1 / s.
So this series expansion has the effect of changing the axis to be 1 over the original axis.
So we start with something in the time domain and get something in the "1 / time" domain = frequency domain.
So that's a picture of speech production.
But we don't get to see that when we doing speech recognition!
All we get is a speech signal which we can compute the log magnitude spectrum of.
What would I like to get for doing automatic speech recognition?
Well, I've said that fundamental frequency is not of interest.
I would like the vocal tract frequency response.
That's the most useful feature for doing speech recognition.
But what I start with is this.
So can I undo the summation?
That's how speech is produced.
But we don't have access to that.
We would like to do this.
I would like to start from.
We can easily compute with the Fourier transform from analysis frame of speech.
We would like to decompose that into the sum of two parts: filter plus source
then for speech recognition, we'll just discard the source.
How might you solve this equation, given only the thing on the left?
One obvious option is to use a source-filter model.
We could use a filter that's defined by its difference equation.
We could solve for the coefficients of that difference equation and that will give us this part.
And then we could just subtract that from the original signal and whatever is left must be this part here, which we might then call the 'remainder' or more formally call it the 'residual' and we'll assume that was the source.
That will be an explicit way of decomposing into source and filter: we would get both the source on the filter.
Actually, we don't care about the source for speech recognition.
We just want the filter, so we could do something a bit simpler.
Fitting an explicit source-filter model involves making very strong assumptions about the form of a filter.
For example, a resonant filter (all pole) and that different equation has a very particular form, with particular coefficients.
We might not want to make such a strong assumption
solving for that difference equation (solving for the coefficients, given a frame of speech waveform) can be error prone
and it's actually not something we're covering this course
we want to solve this apparently difficult-to-solve equation where we know the thing on the left
we want to turn it into a sum of the two things on the right.
These two things have quite different looking properties with respect to this axis here, which is frequency.
This one's quite slowly varying, smooth and slowly varying.
It's a slow function of frequency with respect to the frequency axis.
This one here quite rapidly moving.
It changes rapidly with respect to frequency.
So we would like to decompose this into this slowly varying part and that rapidly varying part
I mean slowly and rapidly varying with respect to this axis, the frequency axis.
I can't directly do that into these two parts, but I can write something more general down like this.
I can say that a log magnitude spectrum of an analysis frame of speech equals something plus something plus something plus something and so on...
where we start off with very slowly-varying parts and then slightly quicker-varying all the way up to eventually very rapidly-varying parts.
Does that look familiar? I hope so!
It's a series expansion, not unlike Fourier analysis.
So it's a transform.
We're going to transform the log magnitude spectrum into a summation of basis functions weighted by coefficients.
Well, we could use the same basis functions as Fourier analysis.
In other words, sinusoids with magnitude and phase.
Or any other suitable series of basis functions.
The only important thing is that the basis functions have to be orthogonal
so they have to be a series of orthogonal functions that don't correlate with each other.
Go and revise the series expansion video if you need to remember what that is.
In this particular case, for a series expansion of the log magnitude spectrum, the most popular choice of basis functions is actually a series of cosines.
where we just need the magnitude of each.
There's no phase there, just exactly cosines
that suits the particular properties of the log magnitude spectrum.
So we're going to write down this.
This part here equals a sum of some constant function times some coefficient (a weight).
Plus some amount of this function: that's the lowest frequency cosine we can fit in there, time some weight
Plus some amount of the next one, plus some amount of the next one, and so on... for as long as we like
here's a suitable [series of] orthogonal basis functions.
That's a cosine series.
It starts with this one, which is just the offset (the 'zero frequency' component, if you like)
and then works its way up through the series
And so we'll be characterising the log magnitude spectrum by these coefficients.
This is a cosine transform
lots and lots of textbooks give you really, really useless and unhelpful analogy to try and understand what's happening here.
They'll say the following to say:
Let's pretend this is time and then do the Fourier transform
Well, we don't need to do that!
A serious expansion is a serious expansion.
There's nothing here that requires this label on this axis to be frequency, time, or anything else.
You don't need to pretend that's the time axis.
You don't need to pretend this is a Fourier transform.
This is just a series expansion.
You got some complicated function and you're expressing it as the sum of simple functions weighted by coefficients.
It's those coefficients that characterise this particular function that we're expanding now.
How does this help us solve what we wanted to solve?
We just write this thing (this magnitude spectrum) as just the sum of two parts:
A slowly-varying part with respect to frequency, that we'll say is the filter.
And a rapidly-varying part we'll say is the source
because here we've got some long series of parts.
Well, we'll just work our way up through the series and at some point say that's as rapidly-varying as we ever see the filter (with respect to frequency) and we'll stop and draw a line.
And everything after that we'll say is not the filter.
So we'll just count up through this series and at some point we'll stop and we'll say: the slow ones are the filter and the rapid ones are the source.
So all we've got to do in fact now, is decide where to draw the line
and there's a very common choice of value there, and that's to keep the first 12 basis functions, counting this one as number one.
This is a special one.
It's just the energy and we count it as zero.
So that's the first basis function, the second, the third, when we go up to the 12th
Again, in other descriptions of cesptral analysis, particularly of the form used to extract MFCC, you might see choices other than the cosine basis functions
you could use a Fourier series, for example
that's a detail that's not important.
The important thing is - conceptually - this is a series expansion into a series of orthogonal basis functions
exactly what functions you expand into is less important.
We won't get into an argument about that!
We'll just say cosine series.
What is the output off that series expansion going to look like?
Well, just like any other series expansion, such as the Fourier transform, we'll plot those coefficients on an axis.
This is frequency in Hz.
Frequency is the same as 1 / time.
Remember that a series expansion gives you a new axis that is "1 / the previous axis".
So 1 / (1 / time) is ... time !
So actually, we gonna have something that got a time axis, a time scale.
This is going to be the size of the coefficient of each of the basis functions.
So we're going to go for 1,2,3,4,5,6,7,8,9,10,11,12 basis functions.
But at that point will say: these guys belong to the filter and then everything else to the right we're actually going to discard because it belongs to the source.
But let's see what it might look like.
Well, we just have some values.
This thing is gonna be hard to interpret.
It's going to be the coefficients off the cosine series.
But what will we find if we kept going well past 12?
At some point there would be a coefficient at some high time where we get some spike again.
Let's think about what that means on the magnitude spectrum.
So this is the cosine series expansion.
These lower coefficients here...
Now, this is is time.
But because this is a transformed from frequency 'onwards', we don't typically label it with time.
We label it with an anagram of frequency and people use the word quefrency.
Don't ask me why!
Its units are seconds.
So this low quefrency coefficient is the slowly moving part.
One of these, perhaps
Here is some faster-moving part - up here some faster-moving part.
Perhaps this one
And eventually this one will be that one that moves at this rate
this is a cosine function that happens to just 'snap onto' the harmonics.
It would just match the harmonics.
So this one here is going to be the fundamental period (T0).
Because this matches the harmonics - F0
For our purposes, we're going to stop here.
I'm going to throw away all of these as being the fine detail and retain the first 12: 1 to 12.
So this truncation is what separates source and filter.
specifically it just retains the filter and discards the source.
Truncation of a series expansion is actually a very well-principled way to smooth a function.
So essentially we just smooth the function on the left to remove the fine detail (the harmonics) and retain the detail up to some certain scale
and our scale is up to 12 coefficients.