From MFCCs, towards a generative model using HMMs

Overview of steps required to derive MFCCs, moving towards modelling MFCCs with Gaussians and Hidden Markov Models

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN AUTOMATIC TRANSCRIPT WITH LIGHT CORRECTIONS TO THE TECHNICAL TERMS
Now what we just saw there was the true cepstrum, which we got from the original log magnitude spectrum
that's going to inspire one of the processing steps that's coming now in the feature extraction pipeline for Mel Frequency Cepstral Coefficients.
So let's get back on track with that.
We've got the frequency domain: the magnitude spectrum.
We've applied a filterbank to that, primarily to warp the scale to the mel scale.
But we can also conveniently choose filter bandwidths that smooth away much of the evidence of F0
We will take the log of the output of those filters.
So when we implement this, normally what we have here is the linear magnitude spectrum, apply the triangular filters and then take the log of their outputs
We could plot those against frequency and draw something that's essentially the spectral envelope of speech
we then take inspiration from the cepstral transform.
And there is a series expansion in that.
We'll use cosine basis functions, and that's called cesptral analysis
that will give us the cepstrum
then we decide how many coefficients would like to retain, so we'll truncate that series typically at coefficient number 12
and that will give us these wonderful coefficients called Mel Frequency Cepstral Coefficients.
In other descriptions of MFCCs, you might see things other than the log here.
Some other compressive function, such as the cube root.
That's a detail that's not conceptually important.
What's important is that it's compressive.
The log is the most well-motivated because it comes from the true cepstrum, which we got from the log magnitude spectrum.
It's the thing that turns multiplication into addition.
This truncation here serves several purposes.
It could be that our filter bank didn't entirely smooth away all the evidence of F0 and so truncation will discard any remaining detail in their outputs.
So we'll get a very smooth spectral envelope by truncating the series to remove any remaining evidence of the source, just in case the filterbank didn't do it completely.
That's number one.
The second thing it does that we'll explain in a moment
we alluded to is that by expanding into a series of orthogonal basis functions, we find a set of coefficients that don't exhibit much covariance with each other.
We removed covariance through the series expansion.
Third, and just as interesting, we've got a place where we can control how many features we get.
12 is the most common choice, but you could vary that.
We get the 12 most important features, the ones that (capture) the detail in the spectral envelope up to a certain fineness of detail.
It is a well-motivated way of controlling the dimensionality of our feature vector.
So did we remove covariance?
We could answer that question theoretically, which we'll try and do now.
We could also answer that question empirically (= by experiment): we could do experiments where we use Gaussians (multivariate Gaussians) with diagonal covariance when we use filterbank features and compare how good that system is with MFCC.
In general we'll find the MFCCs are better
The theoretical argument is that we expand it into a series of orthogonal basis functions, which have no correlation with each other.
They're independent, these are uncorrelate with each other.
And that's the theoretical reason why a series expansion gives you a set of coefficients which don't have covariance - or at least have a lot less covariance than the original filterbank coefficients.
So finally, we've done cepstral analysis and we've got MFCCs.
MFCCs are what you're using in the assignment for the digit recogniser
You're using 12 MFCCs plus one other feature.
The other feature we use is just the energy of the frame, and it's very similar to the zeroth cepstral coefficient (C0) but is just computed from the raw frame energy
So you've actually got 13 features: energy plus 12 MFCCs.
But when you go and look in the configurations for the assignment, or you go look at your trained models, you'll find that you don't have 13.
You've actually got 39!
It's part of the assignment to figure out how we got from 13 to 39-dimensional feature vectors.
And now it's obvious why we don't want to model covariance in 39-dimensional feature space.
We'd have 39 by 39 dimensional covariance matrices.
We'd have to estimate all of those covariances from data.
So coming next:
we've got the Gaussian - that's going to do all of the work
we've seen that the Gaussian can be seen as a generative model
but a single Gaussian can just generate a single data point at a time.
But speech is not that.
Speech is a sequence of observations: a sequence of feature vectors.
So we need a Multivariate Gaussian that can generate one observation, one feature vector for one frame of speech.
We need to put those Gaussians into a model that generates a sequence of observations.
And that's going to be a Hidden Markov model.
And that's coming up next.
That's the model for speech recognition in this course.
We've then got to think: Well, what are we going to model with each hidden Markov model?
We're going to generate a sequence of feature vectors for a whole word.
That's what we're going to do in the assignment, because we've only got 10 words, so we can have just 10 models and we can have lots of examples of each of them to train those models on.
We'll see eventually, that doesn't generalise well to very large vocabularies.
So we're going to model sub-word units, and you might be able to already have a pretty good guess what a suitable sub-word unit might be.
We also need to think about how we're going to make a generative model that could generate a sequence of observations (sequence of feature vectors) that's not just a word, but a sequence of words: a sentence.
We got to do connected speech.
Now it sounds like all of this stuff could be really, really hard.
But it'll turn out that by making the hidden Markov model simply be a finite state model, this is easy because everything will be a finite state model.
So the core ideas coming next:
in module 8 that's the hidden Markov model and we'll get that model
In just one remaining module we can cover everything we need to make a full connected speech recognition system by modelling sequences of words where the words are made of sub-word units
Then we finish with actually the hardest part, which is how to estimate the parameters of the HMM.
We've already thought about how we might do that for an individual Gaussian.
I give you 74 data points, and I tell you they were all emitted from this particular class (= they were labelled with that class).
You could easily fit a Gaussian to those 74 data points.
For example, the mean will be the sum of the data points divided by 74.
you could compute the variance as well.
The problem with estimating the paramters of the HMM is that we won't know frame by frame, which Gaussian each frame is assigned to
so we can't directly do that.
We'll need to align the model and the data
so we need an algorithm for alignment.
So estimating the parameters of the HMM is the hardest part of the course, so we leave it for last!
However, we will see a way of aligning the model and the data much sooner than thatm
When we want to do recognition with an HMM we will also have to decide how to allocate frames in the observation sequence to states in the HMM, so we can compute the probability of each observation coming from a particular state.
That's an alignment problem between a model and the sequence of observations, and it's exactly the same problem as Dynamic Time Warping: alignment between two sequences.
We can solve it with exactly the same algorithm: dynamic programming
we'll call that the Viterbi algorithm: that's dynamic programming as used with a hidden Markov model.
So that's where we're going.