Videos

Total video to watch in this section: 44 minutes (divided into 4 parts for easier viewing)

Assumptions & references made in these videos

The opening slide in the video “Intro to Generative Models” says “Module 7” – this is now Module 8.

The video “Intro to Generative Models” assumes familiarity with discrete and continuous (random) variables, and Bayes rule. The foundation material on probability will be enough for you to understand this video, so go over that first if you need to. We will then consolidate these topics in the lecture for this module.

The last slide in the video “From MFCCs, towards a generative model using HMMs” states that HMMs are in Module 8: that should now read Module 9. It states that estimating the parameters of an HMM is in Module 9: that should now read Module 10.

Intro to Generative Models

Introduces the concept of generative modelling.

slownormalfast

Whilst the video is playing, click on a line in the transcript to play the video from that point.
00:0002:10 This module has two parts. In the first part, we're going to start building the Hidden Markov Model, but only get as far as deciding to use Gaussian probability density functions as the heart of the model. And that will make us pause for thought and think about the features that we need to use with Gaussian probability density functions. So in this first part, I'm going to introduce a rather challenging concept, and that's the idea of a generative model. And we'll choose the Gaussian as our generative model for the features we have for speech recognition. The main conclusion from that part will be that we don't want to model covariance. That is, we're going to have a multivariate Gaussian. That's a Gaussian in a feature space that's multidimensional, the feature vectors extracted from each frame of speech. And we would like to assume that there's no covariance between any pair of elements in that feature vector. For the features that we've seen so far, filter bank features, that's not going to be true. So those features are not going to be directly suitable for modeling with a Gaussian that has no covariance parameters. We often call such a Gaussian diagonal covariance because only the diagonal entries of its covariance matrix have non-zero values. So we'll conclude that we want diagonal covariance Gaussians. We'll realize that filter bank features are not suitable for that. And we need to do what's called feature engineering to those features, some further steps of processing to remove that covariance. And that will result in the final set of features that we're going to use in this course, which are called Mel Frequency Cepstral Coefficients. And to get to them, we'll need to understand what the Cepstral is. So let's orient ourselves in the course as a whole. We're on a journey towards HMMs. We've talked about pattern matching, and we've seen that already in the form of dynamic time warping. There was no probability there. We were comparing a template, an exemplar, with an unknown and measuring distance. And the distance measure was a Euclidean distance for one feature vector in the exemplar compared to one feature vector in the unknown. The Euclidean distance is really what's at the core of the Gaussian probability density function.
02:1003:03 Our conclusion from doing simple pattern matching was that it fails to capture the variability of speech, the natural variability around the mean. A single exemplar is like having just the mean, but we want also the variability around that. That's why we're moving away from exemplars to probabilistic models. So to bridge between pattern matching and probabilistic generative modelling, we're going to have to think again about features, because there's going to be an interaction between the choice of features and choice of model. In particular, we're going to choose a model that makes assumptions about the features, and then we'll need to do things to the features to make those assumptions as true as possible. We've already talked enough about probability and about the Gaussian probability density function in last week's tutorials. We've also gone over why we don't want to model covariance. We also looked at human hearing, and we decided to take a little bit of inspiration from how the cochlear operates.
03:0303:20 The core reason we don't want to model covariance is it massively increases the number of parameters in the Gaussian, and that would require a lot more training data to get good estimates of all those parameters. And we're not going to explicitly model human hearing. We're going to use it as a kind of inspiration for some of the things we do in feature extraction.
03:2003:38 So without repeating too much material from last week, let's have a very quick recap of some concepts that we're going to need. We're modelling features such as the output of a filter in the filter bank, and that's a continuously valued thing. So we're going to need to deal with continuous variables, but also we're going to need to deal with discrete variables.
03:3803:42 At some point, we're going to decide speech as a sequence of words. Words are discrete.
03:4203:56 We can count the number of types, and maybe we're going to decompose the word into a sequence of phonemes, which is also a countable thing, a discrete thing. So we're going to need to deal with both discrete and continuous variables, and these are going to be random variables.
03:5605:03 They're going to take a distribution over a set of possible values or over a range of possible values. An example of a discrete variable then would be the word, and in the assignment, that will be one of 10 possible values. An example of a continuous variable will be the output from one of the filters in the filter bank, or indeed the vector of all the outputs stacked together. That's just a vector space, and that is also a continuous variable in a multidimensional space. And we'd need different types of probability distribution to represent them. But apart from that, everything else that we've learned about probability theory, and in particular, this idea of joint distributions, conditional distributions, and the all-important Bayes formula, the theorem that tells us how to connect all of these things, applies to both discrete and continuous variables. It's true for all of them. And in fact, eventually, we're going to write down Bayes formula with a mixture of discrete and continuous variables in it, words and acoustic features. But that won't be a problem, because there's nothing in all of what we learned about Bayes that tells us it only works for one or the other. It works for both discrete and continuous variables.
05:0305:06 So I'm not going to go over that material again. I'm going to assume that we know that.
05:0605:20 So let's proceed. We can use probability distributions to describe data. That's kind of a summary of the data. And when we do that, we're making an assumption about how those data are distributed.
05:2005:36 For example, if we describe data using a Gaussian distribution, we're assuming that they have a normal distribution. And we might even make some stronger assumptions that they have no covariance. But before choosing the Gaussian, let's stay a little abstract for a while.
05:3605:55 We're not going to choose any particular model. We're going to introduce a concept, a difficult concept. This will be difficult to understand the first time you encounter it. It's a different way of thinking about what we do when we have a model. What is a model? What is a model for? What can it do? We're going to actually build the simplest possible form of model.
05:5506:32 So a generative model is actually as simple as it gets, because a generative model can only do one thing. It can generate examples from its own class. So there's a difficult concept to learn, but it's necessary because it allows us to build the simplest form of model, the generative model. So here's the conceptual leap. Here are three models of three different classes, A, B, and C. We're keeping this abstract. I'm not saying anything about what form of model we're using. So these are black boxes. And the only thing that a model can do, for example, model A here is emit observations. Let's make it do that.
06:3206:37 The observations here are balls and they just have one feature and that's their colour.
06:3706:54 So that just emitted a blue ball. And that's all this model can do. We can randomly emit observations. We can generate from the model. So let's generate. There's another one. So we're generating observations. The observations are data points and so on. And we keep generating.
06:5407:11 So this is a random generator and it generates from its distribution. So inside model A is a distribution over all the different colours of balls that are possible. And it randomly chooses from that according to that distribution and emits an observation.
07:1107:26 So keeping things abstract and not making any statements about what form the model has, it's not got any probability distribution. What are we going to say about model A? Well, we could imagine that it essentially contains lots and lots of data from class A like that.
07:2608:19 And we might have other models too for other classes. And we could imagine for now that they just contain a lot of examples of their classes ready to be generated, ready to be emitted. That's model B and model C. And you can see that it seems likely that model A was going to emit a different distribution of values, of colours, than model B. And model B is going to emit a different distribution to model C. So let's see what we could do with something like that. So we've got three generative models. All they can do is emit observations. Already, and keeping things very abstract without saying anything about what type of model we're using, let's see already that this simple set of three models, A, B and C, one for each of the classes, can do classification. Even though all the models can do is emit observations, they don't make decisions, but we can use them to make decisions, to classify.
08:2008:34 So let's see how a set of generative models can together be used to classify new samples, test samples. So here comes a data point. There it is. It's a ball and its feature is green.
08:3408:39 And I would like to answer the question, which of these models is most likely to have generated it?
08:3908:44 And if I can find the most likely model, I'll label this data point with the class of that model.
08:4409:17 So we're keeping the models abstract. We're pretending that they just contain big bags of balls. So that's what our models look like inside. So we try and generate from the model and we make the model generate this observation. One way of thinking about that, if you want to think in a frequentist way, a counting way, is to emit a batch of observations from each model and count how many of them were green. That's the frequentist view. The Bayesian view is just to say, what's the fraction, what's the probability of emissions of observations from a model that are green?
09:1709:42 And we can look at model A and it's pretty obvious that it's never going to emit a green ball. So the probability of this green ball having been emitted from model A is a probability of zero. I'll write it with one decimal place for precision. Look at model B. It's also clear just by looking that model B never emits green balls either. And the probability of that emitting a green is 0.0.
09:4210:14 Now let's look at model C. You can see that in model C, it emits blue balls and yellow balls and green balls, let's say with approximately equal proportions. So there's about a 0.3 probability of model C emitting a green ball. Clearly, this is the largest probability of the three. We'll classify this as being of class C. Notice that each model could compute only its own probability. It didn't need to know anything about the other models to do that.
10:1411:28 And each model on its own made no decisions. It merely placed a probability on this observed green ball. We then made the decision by comparing the probabilities. So the classifier is the comparison of the probabilities of these models emitting this observation. That's what makes these models simple, that they don't need to know about each other. They're all quite separated. And in particular, when we learn these models from data, we'll be able to learn them from just data labeled with their class. They won't need to see counter examples. They won't need to see examples from other classes. And that's the simplicity of the generative model. Other models are available, models that might be called discriminative models, that know about where the boundary is between, for example, class A and class B, and can draw that decision boundary and directly make the decision as to whether an observation falls on the A side of that line or the B side of that line, of that classification boundary. But these generative models do not do that. We have to form a classifier by comparing their probabilities. Let's classify another observation. So along comes an observation, and our job is to label it as being most likely class A or class B or class C.
11:2812:50 We just do the same operation again. We ask model A, model A, what is the probability that you generate a red ball? We can see just visually for model A, maybe it emits red balls about two thirds of the time. So model A says the probability of me emitting a red ball is about 0.6. Ask model B, model B, what's the probability that you emitted a red ball? Well, we should look, and model B emits red balls and yellow balls and blue balls. They're in about equal proportion. So it says I can emit red balls, and I can do it about a third of the time. Model C, what's the probability that you emitted a red ball? Well, it never emits a red ball. Zero. How do we make the classification decision? We compare these three probabilities, we pick the largest, and we label this test sample with the class of the highest probability model, which is an A. So this is labeled as A. So we've done classification. So you get the idea. Along comes a sample. You ask each of the models in turn, what's the probability that this is a sample from you, from your model, from your internal distribution, whatever that might be. We compute those three probabilities and pick the highest and label the test sample with that class. So as simple as generative models are, they can do classification simply by comparing probabilities.

Gaussian distributions in generative models

Using Gaussian distributions to describe data and as generative models

slownormalfast

Whilst the video is playing, click on a line in the transcript to play the video from that point.
00:0001:41 Let's move on from discrete things and colored balls to continuous values because that's what our features for speech recognition are They're extracted from frames of speech And so far where we've got with that is filter bank features A vector of numbers Each number is the energy in a filter in a band of frequencies for that frame of speech So we need a model of continuous values And the model we're going to choose is the Gaussian So I'm going to assume you know about Gaussians from last week's tutorial But let's just have a very quick reminder Let's do this in two dimensions So I'll just pick two of the filters in the filter bank and draw that two-dimensional space Perhaps I'll pick the third filter and the fourth filter in the filter bank And each of the points I'm going to draw is the pair of filter bank energies It's a little feature vector So each point is a little two-dimensional feature vector containing the energy in the third filter and the energy in the fourth filter So lots of data points If I would like to describe the distribution of this data with a Gaussian It's going to be a multivariate Gaussian This means going to be a vector of two dimensions And its covariance matrix is going to be a two by two matrix I'm going to have here a full covariance matrix Which means I could draw a Gaussian that is this shape on the data We've made the assumption here that the data are distributed normally And so that this parametric probability density function is a good representation of this data So I can use the Gaussian to describe data But how would we use the Gaussian as a generative model?
01:4102:22 Let's do that But let's just do it in one dimension to make things a bit easier to draw So here I've got my three models again And by some means yet to be determined I've learned these models They've come from somewhere And these models are now Gaussians So this is really what the models look like Model A is this Gaussian It has a particular mean and a particular standard deviation Along comes an observation So these are univariate Gaussians Our feature vectors are one-dimensional feature vectors So along comes a one-dimensional feature vector It's just a number And the question is Which of these models is most likely to have generated that number?
02:2203:00 Here's the number 2.1 Now remember that a Gaussian can't compute a probability That would involve integrating the area between two values So for 2.1 all we can say is what's the probability density at 2.1 So off we go 2.1 This value 2.1 This value 2.1 This value Compare those three Clearly this one is the highest And so we'll say this is an A And that's how we'd use these three Gaussians as generative models We'd ask each of them in turn Can you generate the value 2.1?
03:0003:10 Now for a Gaussian the answer is always yes Because all values have non-zero probability So of course we can generate a 2.1 What's the probability density at 2.1?
03:1003:29 We just read that off the curve Because it's a parametric distribution And compare those three probability densities So we've done classification with a Gaussian Let's just draw the three models on top of each other On top of each other to make it even clearer What's the probability of 2.1 being an A or a B or a C?

Cepstral Analysis, mel-scale Filterbanks, MFCCs

We now start thinking about what a good representation of the acoustic signal should be, motivating the use of Mel-Frequency Cepstral Coefficients (MFCCs).

slownormalfast

Whilst the video is playing, click on a line in the transcript to play the video from that point.
00:0000:12 So I'm with a remainder of the class, which is all about feature engineering, and we'll start with the magnitude spectrum and the filter bank features that we've seen before in the last module.
00:1200:13 Quick recap.
00:1300:16 We extract frames from a speech waveform.
00:1600:19 This is in the time domain.
00:1900:23 We extract short-term analysis frames to avoid discontinuities at the edges.
00:2300:27 We apply the discrete Fourier transform and we get the magnitude spectrum.
00:2700:32 This is written on a logarithmic scale, so this is the log magnitude spectrum.
00:3200:37 And from that, we extract filter bank features.
00:3700:43 Filter bank features are found from a single frame of speech in the magnitude spectrum domain.
00:4300:50 We divide it into bands, spaced on a mel scale, and sum the energy in each band and write that into an element of the feature vector.
00:5000:58 We typically use triangular shape filters by spacing their centers and their widths on the mel scale.
00:5801:04 They get further and further apart and wider and wider as we go up frequency in hertz.
01:0401:16 And the energy in each of those, for example, the energy summed in this one, is written into the first element of our feature vector, of a multidimensional feature vector.
01:1601:21 Now let's make it really clear that the features in this feature vector will exhibit a lot of covariance.
01:2101:23 They are highly correlated with each other.
01:2301:26 And that will become very obvious when we look at this animation.
01:2601:35 So remember that each band of frequencies is going into the elements of this feature vector, extracted with these triangular filters.
01:3501:40 So when I play the animation, just look at energy in adjacent frequency bands.
01:4701:53 Clearly, the energy in this band and the energy in this band are going up and down together.
01:5301:54 See that?
01:5401:56 When this one goes up, this one tends to go up.
01:5602:00 And when this one goes down, this one tends to go down.
02:0002:02 They're highly correlated.
02:0202:06 They co-vary because they're adjacent.
02:0602:10 And the spectral envelope is a smooth thing.
02:1002:13 And so these features are highly correlated.
02:1302:21 And if we wanted to model this feature vector with a multivariate Gaussian, it would be important to have a full covariance matrix to capture that.
02:2102:31 So the filter bank energies themselves are perfectly good features, unless we want to model them with a diagonal covariance Gaussian, which is what we've decided to do.
02:3102:34 So we're going to do some feature engineering to get around that problem.
02:3402:40 So we're going to build up to some features called Mel Frequency Cepstral Coefficients.
02:4002:46 And MFCCs, as they're usually called, take inspiration from a number of different directions.
02:4602:55 And one strong way of motivating them is to actually go back and think about speech production again and to remember what we learned a little while ago about convolution.
02:5503:03 Knowing what we know about convolution, we're going to derive a form of analysis called Cepstral analysis.
03:0303:05 Cepstral is an anagram of spectral.
03:0503:07 It's a transform of spectral.
03:0803:18 So here's a recap that convolution in the time domain, convolution of waveforms, is equivalent to addition in the log magnitude spectrum domain.
03:1803:22 So just put aside the idea of the filter bank for a moment and let's go right back to the time domain and start again.
03:2203:25 This is our idealized source.
03:2503:29 This is the impulse response of the vocal track filter.
03:2903:35 And if we convolve those, that's what the star means, we'll get our speech signal in the time domain.
03:3503:42 And that's equivalent to transforming each of those into the spectral domain and plotting on a log scale.
03:4203:47 And then we see that their log magnitude spectra add together.
03:4703:52 So this becomes addition in the log magnitude spectrum domain.
03:5203:59 So convolution is a complicated operation and we might imagine perfectly reasonably that deconvolution is very hard.
03:5904:09 In other words, given that to go backwards and decompose it into source and filter in the time domain is hard, but we could imagine that undoing an addition is rather easier.
04:0904:12 And that's exactly what we're about to do.
04:1204:16 How do we get from the time domain to the frequency domain?
04:1604:17 We'll use the Fourier transform.
04:1704:19 The Fourier transform is just a series expansion.
04:1904:27 So to get from this time domain signal to this frequency domain signal, we did a transform that's a series expansion.
04:2804:41 And that series expansion had the effect of turning this axis from time in units of seconds to this axis in frequency, which has units of what?
04:4104:42 Hertz.
04:4204:47 But Hertz are just one over seconds.
04:4704:52 So the series expansion has the effect of changing the axes to be one over the original axis.
04:5305:00 So we start with something in the time domain, we end up with something in the one over time domain or frequency domain.
05:0005:03 So that's a picture of speech production.
05:0305:06 But we don't get to see that when we're doing speech recognition.
05:0605:10 All we get is a speech signal from which we can compute the log magnitude spectrum.
05:1005:13 What would I like to get for doing automatic speech recognition?
05:1305:16 Well, I've said that fundamental frequency is not of interest.
05:1605:19 I would like the vocal tract frequency response.
05:1905:22 That's the most useful feature for doing speech recognition.
05:2205:24 But what I start with is this.
05:2405:26 So can I undo the summation?
05:2605:30 So that's how speech is produced, but we don't have access to that.
05:3005:31 We would like to do this.
05:3105:37 We would like to start from, we can easily compute with a Fourier transform from an analysis frame of speech.
05:3705:46 And we would like to decompose that into a sum of two parts, filter plus source.
05:4605:50 And then for speech recognition, we'll just discard the source.
05:5005:53 How might you solve this equation given only the thing on the left?
05:5305:58 Well, one obvious option is to use a source filter model.
05:5806:04 We could use a filter that's defined by its difference equation, and we could solve for the coefficients of that difference equation.
06:0406:06 And that will give us this part.
06:0606:17 And then we could just subtract that from the original signal and whatever's left must be this part here, which we might then call a remainder or more formally call it the residual.
06:1706:19 And we'll assume that was the source.
06:1906:26 So that will be an explicit way of decomposing into source and filter, and we'd get both the source and the filter.
06:2606:29 But actually we don't care about the source for speech recognition.
06:2906:33 We just want the filter so we can do something a little bit simpler.
06:3306:38 Fitting an explicit source filter model involves making very strong assumptions about the form of the filter.
06:3806:47 For example, if it's a resonant filter, it's all pole, and that the difference equation has a very particular form with a particular number of coefficients.
06:4706:49 We might not want to make such a strong assumption.
06:4906:56 And solving for that difference equation, solving for the coefficients given a frame of speech waveform can be error prone.
06:5606:59 And it's actually not something we cover in this course.
06:5907:07 So I want to solve this apparently difficult to solve equation where we know the thing on the left, and we want to turn it into some of two things on the right.
07:0707:12 These two things have quite different looking properties.
07:1207:20 With respect to this axis here, which is frequency, this one's quite slowly varying, smooth and slowly varying.
07:2007:23 It's a slow function of frequency.
07:2307:28 With respect to the frequency axis, this one here, it's quite rapidly moving.
07:2807:31 It changes rapidly with respect to frequency.
07:3107:40 So we would like to decompose this into the slowly varying part and the rapidly varying part.
07:4107:46 And I mean slowly and rapidly varying with respect to this axis, the frequency axis.
07:4607:53 So I can't directly do that into these two parts, but I can write something more general down like this.
07:5308:21 I can say that a log magnitude spectrum of an analysis frame of speech equals something plus something, plus something, plus something, and so on, where we start off with very slowly varying parts and then slightly quicker varying all the way up to eventually very rapidly varying parts.
08:2108:22 Does that look familiar?
08:2208:24 I hope so.
08:2408:29 That's a series expansion, not unlike Fourier analysis.
08:2908:30 So it's a transform.
08:3008:38 We're going to transform the log magnitude spectrum into a summation of basis functions weighted by coefficients.
08:3808:47 Well, we could use the same basis functions as Fourier analysis, in other words, sinusoids with magnitude and phase or any other suitable set of basis functions.
08:4708:50 The only important thing is that the basis functions have to be orthogonal.
08:5008:56 So they have to be a series of orthogonal functions that don't correlate with each other.
08:5609:01 So go and revise the series expansion video if you need to remember what that is.
09:0109:06 In this particular case, we're doing a series expansion of the log magnitude spectrum.
09:0609:12 The most popular choice of basis functions is actually a series of cosines where we just need the magnitude of each.
09:1209:14 There's no phase, they're just exactly cosines.
09:1409:18 That suits the particular properties of the log magnitude spectrum.
09:1809:20 So we're going to write down this.
09:2009:51 This part here equals a sum of some constant function times some coefficient, a weight, plus some amount of this function, that's the lowest frequency cosine we can fit in there, times some weight, plus some amount of the next one, plus some amount of the next one, and so on for as long as we like.
09:5109:56 There's a set of orthogonal basis functions that's a cosine series.
09:5610:04 It starts with this one, which is just the offset, the zero frequency component, if you like, and then works its way up through the series.
10:0410:13 And so we'll be characterizing the log magnitude spectrum by these coefficients.
10:1310:15 This is a cosine transform.
10:1510:22 Lots and lots of textbooks give you a really, really useless and unhelpful analogy to try and understand what's happening here.
10:2210:23 They'll say the following.
10:2310:30 They say, let's pretend this is time and then do the Fourier transform, but we don't need to do that.
10:3010:32 A series expansion is a series expansion.
10:3210:38 There's nothing here that requires this label on this axis to be frequency, time, or anything else.
10:3810:40 You don't need to pretend that's a time axis.
10:4010:42 You don't need to pretend this is a Fourier transform.
10:4210:44 This is just a series expansion.
10:4410:50 You've got some complicated function and you're expressing it as a sum of simple functions weighted by coefficients.
10:5010:55 It's those coefficients that characterize this particular function that we're expanding.
10:5511:15 Now how does this help us solve what we wanted to solve, which is just to write this thing, this magnitude spectrum, as just a sum of two parts, a slowly varying part with respect to frequency that we'll say is the filter, and a rapidly varying part we'll say is the source because here we've got some long series of parts.
11:1511:28 Well we'll just work our way up through the series and at some point say that's as rapidly varying as we ever see the filter with respect to frequency and we'll stop and draw a line and everything after that we'll say is not the filter.
11:2811:44 So we'll just count up through these series and at some point we'll stop and we'll say the slow ones are the filter and the rapid ones are the source.
11:4411:54 So all we've got to do in fact now is decide where to draw the line and there's a very common choice of value there and that's to keep the first 12 basis functions.
11:5411:55 Counting this one as number one.
11:5511:56 This is a special one.
11:5611:58 It's just the energy and we count that as zero.
11:5812:05 So that's the first basis function, the second, the third, and we go up to the twelfth.
12:0512:15 And in other descriptions of Cepstral analysis, particularly of the form used to extract MFCCs, you might see choices other than the cosine basis functions.
12:1512:18 You could use the Fourier series for example.
12:1812:20 That's a detail that's not important.
12:2012:28 The important thing is conceptually this is a series expansion into a series of orthogonal basis functions.
12:2812:31 Exactly what functions you expand it into is less important.
12:3112:33 We won't get into an argument about that.
12:3312:38 We'll just say cosine series.
12:3812:42 What is the output of that series expansion going to look like?
12:4212:51 Well, just like any other series expansion such as the Fourier transform, we'll plot those coefficients on an axis.
12:5112:56 This is frequency in Hertz.
12:5612:58 Frequency is the same as one over time.
12:5813:02 So that series expansion gives you a new axis, which is one over the previous axis.
13:0213:06 So one over one over time is time.
13:0613:10 So actually we're going to have something that's got a time axis, time scale.
13:1013:15 This is going to be the size of the coefficient of each of the basis functions.
13:1513:21 So we're going to go for 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 basis functions.
13:2113:31 And at that point we'll say these guys belong to the filter and then everything else to the right we're actually going to discard because it belongs to the source.
13:3113:32 But let's see what it might look like.
13:3213:35 Well, we just have some values.
13:3513:40 This thing is going to be hard to interpret, going to be the coefficients of the cosine series.
13:4013:50 But what we'll find if we kept going well past 12, at some point there would be a coefficient at some high time where we get some spike again.
13:5113:54 Let's think about what that means on the magnitude spectrum.
13:5413:57 So this is the cosine series expansion.
13:5714:02 These lower coefficients here, now this is called, this is time.
14:0214:09 But because this is a transform from frequency onwards, we don't typically label it with time.
14:0914:14 We label it with an anagram of frequency and people use the word queferency.
14:1414:18 Don't ask me why.
14:1814:20 It's the units are seconds.
14:2114:26 So this low queferency coefficient is the slowly moving part.
14:2614:35 One of these perhaps here is some faster moving part, up here some faster moving part, perhaps this one.
14:3514:43 And eventually this one will be the one that moves at this rate.
14:4314:51 This is a cosine function that happens to just snap onto the harmonics and it will just match the harmonics.
14:5114:57 So this one here is going to be the fundamental period.
14:5715:01 Because this matches the harmonics at F0.
15:0115:12 For our purposes, we're going to stop here, going to throw away all of these as being the fine detail and retain the first 12, 1 to 12.
15:1215:19 So this truncation is what separates source and filter and specifically it just retains the filter and discards the source.
15:1915:24 Truncation of a series expansion is actually a very well principled way to smooth a function.
15:2415:36 So essentially we just smooth the function on the left to remove the fine detail, the harmonics, and retain the detail up to some certain scale and our scale is up to 12 coefficients.

From MFCCs, towards a generative model using HMMs

Overview of steps required to derive MFCCs, moving towards modelling MFCCs with Gaussians and Hidden Markov Models

slownormalfast

Whilst the video is playing, click on a line in the transcript to play the video from that point.
00:0002:46 Now what we just saw there was the true cepstrum, which we got from the original log magnitude spectrum That's going to inspire one of the processing steps that's coming now in the feature extraction pipeline for Mel Frequency Cepstral Coefficients So let's get back on track with that. We've got the frequency domain, the magnitude spectrum We've applied a filter bank to that primarily to warp the scale to the mel scale But we can also conveniently choose filter bandwidths that smooth away much of the evidence of f0 and We'll take the log of the output of those filters. So when we implement this Normally what we have here is the linear magnitude spectrogram Apply the triangular filters and then take the log of their outputs You can plot those against frequency and draw something that's essentially the spectral envelope of speech And we then take inspiration from the cepstral transform and do as a series expansion of that we'll use Cosine basis functions and that's called cepstral analysis and that will give us the cepstrum and then we decide how many coefficients we'd like to retain So we'll truncate that series Typically a coefficient number 12 and that will give us these wonderful coefficients called Mel Frequency Cepstral Coefficients In other descriptions of MFCCs you might see things other than the log here some other compressive function Such as the cube root. That's a detail that's not conceptually important. What's important is that it's compressive The log is the most well motivated because it comes from the true cepstrum which we get from the log magnitude spectrum It's the thing that turns multiplication into addition This truncation here serves several purposes It could be that our filter bank didn't entirely smooth away all the evidence of f0 and So truncation will discard any remaining detail in their outputs So we'll get a very smooth spectral envelope by truncating the series So it removes any remaining evidence of the source just in case the filter bank didn't do it completely. That's number one The second thing it does I'm going to explain in a moment We've alluded to is that by expanding into a series of orthogonal basis functions We find a set of coefficients that don't exhibit much covariance with each other so we've removed covariance through this series expansion and third and Just as interesting We've got a place where we can control how many features we get 12 is the most common choice But you could vary that and we get the 12 most important features The ones that the detail in the spectral envelope up to a certain fineness of detail There's a well motivated way of controlling the dimensionality of our feature vector. So did we remove covariance?
02:4605:25 Well, we could answer that question theoretically, which we'll try and do now We could also answer that question empirically by experiment. We could do experiments where we use Gaussians Multivariate Gaussians with diagonal covariance and we use filter bank features and we compare how good that system is with MFCCs and in general, we'll find that MFCCs are better The theoretical argument is that we expanded into a series of orthogonal basis functions Which have no correlation with each other they're independent These Are uncorrelated with each other and that's the theoretical reason why a series expansion gives you a set of coefficients Which don't have covariance or at least have a lot less covariance than the original filter bank coefficients So finally, we've done cepstral analysis and we've got MFCCs MFCCs are what you're using in the assignment for the digit recognizer You're using 12 MFCCs plus one other feature The other feature we use is just the energy of the frame And it's very similar to the zeroth cepstral coefficient. It's just computed from the raw frame energy So you've actually got 13 features there energy plus 12 MFCCs But when you go and look in the configurations for the assignment and you go and look at your train models You'll find that you don't have 13 You've actually got 39 and it's part of the assignment to figure out how we got from 13 to 39 dimensional feature vectors And now it's obvious why we don't want to model covariance in the 39 dimensional feature space. We'd have 39 by 39 Dimensional covariance matrices. We'd have to estimate all of those covariances from data. So coming next we've got the Gaussian That's going to do all of the work. We've seen that the Gaussian can be seen as a generative model But a single Gaussian can just generate a single data point at a time But speech is not that. Speech is a sequence of observations, a sequence of feature vectors So we need a multivariate Gaussian that can generate one observation, one feature vector for one frame of speech And we need to put those Gaussians into a model that generates a sequence of observations And that's going to be the Hidden Markov Model and that's coming up next. That's the model for speech recognition in this course We've then got to think well, what are we going to model with each Hidden Markov Model?