Cepstral Analysis, mel-scale Filterbanks, MFCCs

We now start thinking about what a good representation of the acoustic signal should be, motivating the use of Mel-Frequency Cepstral Coefficients (MFCCs).

Whilst the video is playing, click on a line in the transcript to play the video from that point.
00:0000:12 So I'm with a remainder of the class, which is all about feature engineering, and we'll start with the magnitude spectrum and the filter bank features that we've seen before in the last module.
00:1200:13 Quick recap.
00:1300:16 We extract frames from a speech waveform.
00:1600:19 This is in the time domain.
00:1900:23 We extract short-term analysis frames to avoid discontinuities at the edges.
00:2300:27 We apply the discrete Fourier transform and we get the magnitude spectrum.
00:2700:32 This is written on a logarithmic scale, so this is the log magnitude spectrum.
00:3200:37 And from that, we extract filter bank features.
00:3700:43 Filter bank features are found from a single frame of speech in the magnitude spectrum domain.
00:4300:50 We divide it into bands, spaced on a mel scale, and sum the energy in each band and write that into an element of the feature vector.
00:5000:58 We typically use triangular shape filters by spacing their centers and their widths on the mel scale.
00:5801:04 They get further and further apart and wider and wider as we go up frequency in hertz.
01:0401:16 And the energy in each of those, for example, the energy summed in this one, is written into the first element of our feature vector, of a multidimensional feature vector.
01:1601:21 Now let's make it really clear that the features in this feature vector will exhibit a lot of covariance.
01:2101:23 They are highly correlated with each other.
01:2301:26 And that will become very obvious when we look at this animation.
01:2601:35 So remember that each band of frequencies is going into the elements of this feature vector, extracted with these triangular filters.
01:3501:40 So when I play the animation, just look at energy in adjacent frequency bands.
01:4701:53 Clearly, the energy in this band and the energy in this band are going up and down together.
01:5301:54 See that?
01:5401:56 When this one goes up, this one tends to go up.
01:5602:00 And when this one goes down, this one tends to go down.
02:0002:02 They're highly correlated.
02:0202:06 They co-vary because they're adjacent.
02:0602:10 And the spectral envelope is a smooth thing.
02:1002:13 And so these features are highly correlated.
02:1302:21 And if we wanted to model this feature vector with a multivariate Gaussian, it would be important to have a full covariance matrix to capture that.
02:2102:31 So the filter bank energies themselves are perfectly good features, unless we want to model them with a diagonal covariance Gaussian, which is what we've decided to do.
02:3102:34 So we're going to do some feature engineering to get around that problem.
02:3402:40 So we're going to build up to some features called Mel Frequency Cepstral Coefficients.
02:4002:46 And MFCCs, as they're usually called, take inspiration from a number of different directions.
02:4602:55 And one strong way of motivating them is to actually go back and think about speech production again and to remember what we learned a little while ago about convolution.
02:5503:03 Knowing what we know about convolution, we're going to derive a form of analysis called Cepstral analysis.
03:0303:05 Cepstral is an anagram of spectral.
03:0503:07 It's a transform of spectral.
03:0803:18 So here's a recap that convolution in the time domain, convolution of waveforms, is equivalent to addition in the log magnitude spectrum domain.
03:1803:22 So just put aside the idea of the filter bank for a moment and let's go right back to the time domain and start again.
03:2203:25 This is our idealized source.
03:2503:29 This is the impulse response of the vocal track filter.
03:2903:35 And if we convolve those, that's what the star means, we'll get our speech signal in the time domain.
03:3503:42 And that's equivalent to transforming each of those into the spectral domain and plotting on a log scale.
03:4203:47 And then we see that their log magnitude spectra add together.
03:4703:52 So this becomes addition in the log magnitude spectrum domain.
03:5203:59 So convolution is a complicated operation and we might imagine perfectly reasonably that deconvolution is very hard.
03:5904:09 In other words, given that to go backwards and decompose it into source and filter in the time domain is hard, but we could imagine that undoing an addition is rather easier.
04:0904:12 And that's exactly what we're about to do.
04:1204:16 How do we get from the time domain to the frequency domain?
04:1604:17 We'll use the Fourier transform.
04:1704:19 The Fourier transform is just a series expansion.
04:1904:27 So to get from this time domain signal to this frequency domain signal, we did a transform that's a series expansion.
04:2804:41 And that series expansion had the effect of turning this axis from time in units of seconds to this axis in frequency, which has units of what?
04:4104:42 Hertz.
04:4204:47 But Hertz are just one over seconds.
04:4704:52 So the series expansion has the effect of changing the axes to be one over the original axis.
04:5305:00 So we start with something in the time domain, we end up with something in the one over time domain or frequency domain.
05:0005:03 So that's a picture of speech production.
05:0305:06 But we don't get to see that when we're doing speech recognition.
05:0605:10 All we get is a speech signal from which we can compute the log magnitude spectrum.
05:1005:13 What would I like to get for doing automatic speech recognition?
05:1305:16 Well, I've said that fundamental frequency is not of interest.
05:1605:19 I would like the vocal tract frequency response.
05:1905:22 That's the most useful feature for doing speech recognition.
05:2205:24 But what I start with is this.
05:2405:26 So can I undo the summation?
05:2605:30 So that's how speech is produced, but we don't have access to that.
05:3005:31 We would like to do this.
05:3105:37 We would like to start from, we can easily compute with a Fourier transform from an analysis frame of speech.
05:3705:46 And we would like to decompose that into a sum of two parts, filter plus source.
05:4605:50 And then for speech recognition, we'll just discard the source.
05:5005:53 How might you solve this equation given only the thing on the left?
05:5305:58 Well, one obvious option is to use a source filter model.
05:5806:04 We could use a filter that's defined by its difference equation, and we could solve for the coefficients of that difference equation.
06:0406:06 And that will give us this part.
06:0606:17 And then we could just subtract that from the original signal and whatever's left must be this part here, which we might then call a remainder or more formally call it the residual.
06:1706:19 And we'll assume that was the source.
06:1906:26 So that will be an explicit way of decomposing into source and filter, and we'd get both the source and the filter.
06:2606:29 But actually we don't care about the source for speech recognition.
06:2906:33 We just want the filter so we can do something a little bit simpler.
06:3306:38 Fitting an explicit source filter model involves making very strong assumptions about the form of the filter.
06:3806:47 For example, if it's a resonant filter, it's all pole, and that the difference equation has a very particular form with a particular number of coefficients.
06:4706:49 We might not want to make such a strong assumption.
06:4906:56 And solving for that difference equation, solving for the coefficients given a frame of speech waveform can be error prone.
06:5606:59 And it's actually not something we cover in this course.
06:5907:07 So I want to solve this apparently difficult to solve equation where we know the thing on the left, and we want to turn it into some of two things on the right.
07:0707:12 These two things have quite different looking properties.
07:1207:20 With respect to this axis here, which is frequency, this one's quite slowly varying, smooth and slowly varying.
07:2007:23 It's a slow function of frequency.
07:2307:28 With respect to the frequency axis, this one here, it's quite rapidly moving.
07:2807:31 It changes rapidly with respect to frequency.
07:3107:40 So we would like to decompose this into the slowly varying part and the rapidly varying part.
07:4107:46 And I mean slowly and rapidly varying with respect to this axis, the frequency axis.
07:4607:53 So I can't directly do that into these two parts, but I can write something more general down like this.
07:5308:21 I can say that a log magnitude spectrum of an analysis frame of speech equals something plus something, plus something, plus something, and so on, where we start off with very slowly varying parts and then slightly quicker varying all the way up to eventually very rapidly varying parts.
08:2108:22 Does that look familiar?
08:2208:24 I hope so.
08:2408:29 That's a series expansion, not unlike Fourier analysis.
08:2908:30 So it's a transform.
08:3008:38 We're going to transform the log magnitude spectrum into a summation of basis functions weighted by coefficients.
08:3808:47 Well, we could use the same basis functions as Fourier analysis, in other words, sinusoids with magnitude and phase or any other suitable set of basis functions.
08:4708:50 The only important thing is that the basis functions have to be orthogonal.
08:5008:56 So they have to be a series of orthogonal functions that don't correlate with each other.
08:5609:01 So go and revise the series expansion video if you need to remember what that is.
09:0109:06 In this particular case, we're doing a series expansion of the log magnitude spectrum.
09:0609:12 The most popular choice of basis functions is actually a series of cosines where we just need the magnitude of each.
09:1209:14 There's no phase, they're just exactly cosines.
09:1409:18 That suits the particular properties of the log magnitude spectrum.
09:1809:20 So we're going to write down this.
09:2009:51 This part here equals a sum of some constant function times some coefficient, a weight, plus some amount of this function, that's the lowest frequency cosine we can fit in there, times some weight, plus some amount of the next one, plus some amount of the next one, and so on for as long as we like.
09:5109:56 There's a set of orthogonal basis functions that's a cosine series.
09:5610:04 It starts with this one, which is just the offset, the zero frequency component, if you like, and then works its way up through the series.
10:0410:13 And so we'll be characterizing the log magnitude spectrum by these coefficients.
10:1310:15 This is a cosine transform.
10:1510:22 Lots and lots of textbooks give you a really, really useless and unhelpful analogy to try and understand what's happening here.
10:2210:23 They'll say the following.
10:2310:30 They say, let's pretend this is time and then do the Fourier transform, but we don't need to do that.
10:3010:32 A series expansion is a series expansion.
10:3210:38 There's nothing here that requires this label on this axis to be frequency, time, or anything else.
10:3810:40 You don't need to pretend that's a time axis.
10:4010:42 You don't need to pretend this is a Fourier transform.
10:4210:44 This is just a series expansion.
10:4410:50 You've got some complicated function and you're expressing it as a sum of simple functions weighted by coefficients.
10:5010:55 It's those coefficients that characterize this particular function that we're expanding.
10:5511:15 Now how does this help us solve what we wanted to solve, which is just to write this thing, this magnitude spectrum, as just a sum of two parts, a slowly varying part with respect to frequency that we'll say is the filter, and a rapidly varying part we'll say is the source because here we've got some long series of parts.
11:1511:28 Well we'll just work our way up through the series and at some point say that's as rapidly varying as we ever see the filter with respect to frequency and we'll stop and draw a line and everything after that we'll say is not the filter.
11:2811:44 So we'll just count up through these series and at some point we'll stop and we'll say the slow ones are the filter and the rapid ones are the source.
11:4411:54 So all we've got to do in fact now is decide where to draw the line and there's a very common choice of value there and that's to keep the first 12 basis functions.
11:5411:55 Counting this one as number one.
11:5511:56 This is a special one.
11:5611:58 It's just the energy and we count that as zero.
11:5812:05 So that's the first basis function, the second, the third, and we go up to the twelfth.
12:0512:15 And in other descriptions of Cepstral analysis, particularly of the form used to extract MFCCs, you might see choices other than the cosine basis functions.
12:1512:18 You could use the Fourier series for example.
12:1812:20 That's a detail that's not important.
12:2012:28 The important thing is conceptually this is a series expansion into a series of orthogonal basis functions.
12:2812:31 Exactly what functions you expand it into is less important.
12:3112:33 We won't get into an argument about that.
12:3312:38 We'll just say cosine series.
12:3812:42 What is the output of that series expansion going to look like?
12:4212:51 Well, just like any other series expansion such as the Fourier transform, we'll plot those coefficients on an axis.
12:5112:56 This is frequency in Hertz.
12:5612:58 Frequency is the same as one over time.
12:5813:02 So that series expansion gives you a new axis, which is one over the previous axis.
13:0213:06 So one over one over time is time.
13:0613:10 So actually we're going to have something that's got a time axis, time scale.
13:1013:15 This is going to be the size of the coefficient of each of the basis functions.
13:1513:21 So we're going to go for 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 basis functions.
13:2113:31 And at that point we'll say these guys belong to the filter and then everything else to the right we're actually going to discard because it belongs to the source.
13:3113:32 But let's see what it might look like.
13:3213:35 Well, we just have some values.
13:3513:40 This thing is going to be hard to interpret, going to be the coefficients of the cosine series.
13:4013:50 But what we'll find if we kept going well past 12, at some point there would be a coefficient at some high time where we get some spike again.
13:5113:54 Let's think about what that means on the magnitude spectrum.
13:5413:57 So this is the cosine series expansion.
13:5714:02 These lower coefficients here, now this is called, this is time.
14:0214:09 But because this is a transform from frequency onwards, we don't typically label it with time.
14:0914:14 We label it with an anagram of frequency and people use the word queferency.
14:1414:18 Don't ask me why.
14:1814:20 It's the units are seconds.
14:2114:26 So this low queferency coefficient is the slowly moving part.
14:2614:35 One of these perhaps here is some faster moving part, up here some faster moving part, perhaps this one.
14:3514:43 And eventually this one will be the one that moves at this rate.
14:4314:51 This is a cosine function that happens to just snap onto the harmonics and it will just match the harmonics.
14:5114:57 So this one here is going to be the fundamental period.
14:5715:01 Because this matches the harmonics at F0.
15:0115:12 For our purposes, we're going to stop here, going to throw away all of these as being the fine detail and retain the first 12, 1 to 12.
15:1215:19 So this truncation is what separates source and filter and specifically it just retains the filter and discards the source.
15:1915:24 Truncation of a series expansion is actually a very well principled way to smooth a function.
15:2415:36 So essentially we just smooth the function on the left to remove the fine detail, the harmonics, and retain the detail up to some certain scale and our scale is up to 12 coefficients.

Log in if you want to mark this as completed
Excellent 68
Very helpful 10
Quite helpful 6
Slightly helpful 0
Confusing 2
No rating 0
My brain hurts 1
Really quite difficult 13
Getting harder 21
Just right 51
Pretty simple 0
No rating 0