Whilst the video is playing, click on a line in the transcript to play the video from that point. Now what we just saw there was the true cepstrum, which we got from the original log magnitude spectrum That's going to inspire one of the processing steps that's coming now in the feature extraction pipeline for Mel Frequency Cepstral Coefficients So let's get back on track with that. We've got the frequency domain, the magnitude spectrum We've applied a filter bank to that primarily to warp the scale to the mel scale But we can also conveniently choose filter bandwidths that smooth away much of the evidence of f0 and We'll take the log of the output of those filters. So when we implement this Normally what we have here is the linear magnitude spectrogram Apply the triangular filters and then take the log of their outputs You can plot those against frequency and draw something that's essentially the spectral envelope of speech And we then take inspiration from the cepstral transform and do as a series expansion of that we'll use Cosine basis functions and that's called cepstral analysis and that will give us the cepstrum and then we decide how many coefficients we'd like to retain So we'll truncate that series Typically a coefficient number 12 and that will give us these wonderful coefficients called Mel Frequency Cepstral Coefficients In other descriptions of MFCCs you might see things other than the log here some other compressive function Such as the cube root. That's a detail that's not conceptually important. What's important is that it's compressive The log is the most well motivated because it comes from the true cepstrum which we get from the log magnitude spectrum It's the thing that turns multiplication into addition This truncation here serves several purposes It could be that our filter bank didn't entirely smooth away all the evidence of f0 and So truncation will discard any remaining detail in their outputs So we'll get a very smooth spectral envelope by truncating the series So it removes any remaining evidence of the source just in case the filter bank didn't do it completely. That's number one The second thing it does I'm going to explain in a moment We've alluded to is that by expanding into a series of orthogonal basis functions We find a set of coefficients that don't exhibit much covariance with each other so we've removed covariance through this series expansion and third and Just as interesting We've got a place where we can control how many features we get 12 is the most common choice But you could vary that and we get the 12 most important features The ones that the detail in the spectral envelope up to a certain fineness of detail There's a well motivated way of controlling the dimensionality of our feature vector. So did we remove covariance? Well, we could answer that question theoretically, which we'll try and do now We could also answer that question empirically by experiment. We could do experiments where we use Gaussians Multivariate Gaussians with diagonal covariance and we use filter bank features and we compare how good that system is with MFCCs and in general, we'll find that MFCCs are better The theoretical argument is that we expanded into a series of orthogonal basis functions Which have no correlation with each other they're independent These Are uncorrelated with each other and that's the theoretical reason why a series expansion gives you a set of coefficients Which don't have covariance or at least have a lot less covariance than the original filter bank coefficients So finally, we've done cepstral analysis and we've got MFCCs MFCCs are what you're using in the assignment for the digit recognizer You're using 12 MFCCs plus one other feature The other feature we use is just the energy of the frame And it's very similar to the zeroth cepstral coefficient. It's just computed from the raw frame energy So you've actually got 13 features there energy plus 12 MFCCs But when you go and look in the configurations for the assignment and you go and look at your train models You'll find that you don't have 13 You've actually got 39 and it's part of the assignment to figure out how we got from 13 to 39 dimensional feature vectors And now it's obvious why we don't want to model covariance in the 39 dimensional feature space. We'd have 39 by 39 Dimensional covariance matrices. We'd have to estimate all of those covariances from data. So coming next we've got the Gaussian That's going to do all of the work. We've seen that the Gaussian can be seen as a generative model But a single Gaussian can just generate a single data point at a time But speech is not that. Speech is a sequence of observations, a sequence of feature vectors So we need a multivariate Gaussian that can generate one observation, one feature vector for one frame of speech And we need to put those Gaussians into a model that generates a sequence of observations And that's going to be the Hidden Markov Model and that's coming up next. That's the model for speech recognition in this course We've then got to think well, what are we going to model with each Hidden Markov Model?
From MFCCs, towards a generative model using HMMs
Overview of steps required to derive MFCCs, moving towards modelling MFCCs with Gaussians and Hidden Markov Models
Log in if you want to mark this as completed
|
| ||||||||||||||||||||||||||||||||||||||||||