Intro to Generative Models

Introduces the concept of generative modelling.

Whilst the video is playing, click on a line in the transcript to play the video from that point.
00:0002:10 This module has two parts. In the first part, we're going to start building the Hidden Markov Model, but only get as far as deciding to use Gaussian probability density functions as the heart of the model. And that will make us pause for thought and think about the features that we need to use with Gaussian probability density functions. So in this first part, I'm going to introduce a rather challenging concept, and that's the idea of a generative model. And we'll choose the Gaussian as our generative model for the features we have for speech recognition. The main conclusion from that part will be that we don't want to model covariance. That is, we're going to have a multivariate Gaussian. That's a Gaussian in a feature space that's multidimensional, the feature vectors extracted from each frame of speech. And we would like to assume that there's no covariance between any pair of elements in that feature vector. For the features that we've seen so far, filter bank features, that's not going to be true. So those features are not going to be directly suitable for modeling with a Gaussian that has no covariance parameters. We often call such a Gaussian diagonal covariance because only the diagonal entries of its covariance matrix have non-zero values. So we'll conclude that we want diagonal covariance Gaussians. We'll realize that filter bank features are not suitable for that. And we need to do what's called feature engineering to those features, some further steps of processing to remove that covariance. And that will result in the final set of features that we're going to use in this course, which are called Mel Frequency Cepstral Coefficients. And to get to them, we'll need to understand what the Cepstral is. So let's orient ourselves in the course as a whole. We're on a journey towards HMMs. We've talked about pattern matching, and we've seen that already in the form of dynamic time warping. There was no probability there. We were comparing a template, an exemplar, with an unknown and measuring distance. And the distance measure was a Euclidean distance for one feature vector in the exemplar compared to one feature vector in the unknown. The Euclidean distance is really what's at the core of the Gaussian probability density function.
02:1003:03 Our conclusion from doing simple pattern matching was that it fails to capture the variability of speech, the natural variability around the mean. A single exemplar is like having just the mean, but we want also the variability around that. That's why we're moving away from exemplars to probabilistic models. So to bridge between pattern matching and probabilistic generative modelling, we're going to have to think again about features, because there's going to be an interaction between the choice of features and choice of model. In particular, we're going to choose a model that makes assumptions about the features, and then we'll need to do things to the features to make those assumptions as true as possible. We've already talked enough about probability and about the Gaussian probability density function in last week's tutorials. We've also gone over why we don't want to model covariance. We also looked at human hearing, and we decided to take a little bit of inspiration from how the cochlear operates.
03:0303:20 The core reason we don't want to model covariance is it massively increases the number of parameters in the Gaussian, and that would require a lot more training data to get good estimates of all those parameters. And we're not going to explicitly model human hearing. We're going to use it as a kind of inspiration for some of the things we do in feature extraction.
03:2003:38 So without repeating too much material from last week, let's have a very quick recap of some concepts that we're going to need. We're modelling features such as the output of a filter in the filter bank, and that's a continuously valued thing. So we're going to need to deal with continuous variables, but also we're going to need to deal with discrete variables.
03:3803:42 At some point, we're going to decide speech as a sequence of words. Words are discrete.
03:4203:56 We can count the number of types, and maybe we're going to decompose the word into a sequence of phonemes, which is also a countable thing, a discrete thing. So we're going to need to deal with both discrete and continuous variables, and these are going to be random variables.
03:5605:03 They're going to take a distribution over a set of possible values or over a range of possible values. An example of a discrete variable then would be the word, and in the assignment, that will be one of 10 possible values. An example of a continuous variable will be the output from one of the filters in the filter bank, or indeed the vector of all the outputs stacked together. That's just a vector space, and that is also a continuous variable in a multidimensional space. And we'd need different types of probability distribution to represent them. But apart from that, everything else that we've learned about probability theory, and in particular, this idea of joint distributions, conditional distributions, and the all-important Bayes formula, the theorem that tells us how to connect all of these things, applies to both discrete and continuous variables. It's true for all of them. And in fact, eventually, we're going to write down Bayes formula with a mixture of discrete and continuous variables in it, words and acoustic features. But that won't be a problem, because there's nothing in all of what we learned about Bayes that tells us it only works for one or the other. It works for both discrete and continuous variables.
05:0305:06 So I'm not going to go over that material again. I'm going to assume that we know that.
05:0605:20 So let's proceed. We can use probability distributions to describe data. That's kind of a summary of the data. And when we do that, we're making an assumption about how those data are distributed.
05:2005:36 For example, if we describe data using a Gaussian distribution, we're assuming that they have a normal distribution. And we might even make some stronger assumptions that they have no covariance. But before choosing the Gaussian, let's stay a little abstract for a while.
05:3605:55 We're not going to choose any particular model. We're going to introduce a concept, a difficult concept. This will be difficult to understand the first time you encounter it. It's a different way of thinking about what we do when we have a model. What is a model? What is a model for? What can it do? We're going to actually build the simplest possible form of model.
05:5506:32 So a generative model is actually as simple as it gets, because a generative model can only do one thing. It can generate examples from its own class. So there's a difficult concept to learn, but it's necessary because it allows us to build the simplest form of model, the generative model. So here's the conceptual leap. Here are three models of three different classes, A, B, and C. We're keeping this abstract. I'm not saying anything about what form of model we're using. So these are black boxes. And the only thing that a model can do, for example, model A here is emit observations. Let's make it do that.
06:3206:37 The observations here are balls and they just have one feature and that's their colour.
06:3706:54 So that just emitted a blue ball. And that's all this model can do. We can randomly emit observations. We can generate from the model. So let's generate. There's another one. So we're generating observations. The observations are data points and so on. And we keep generating.
06:5407:11 So this is a random generator and it generates from its distribution. So inside model A is a distribution over all the different colours of balls that are possible. And it randomly chooses from that according to that distribution and emits an observation.
07:1107:26 So keeping things abstract and not making any statements about what form the model has, it's not got any probability distribution. What are we going to say about model A? Well, we could imagine that it essentially contains lots and lots of data from class A like that.
07:2608:19 And we might have other models too for other classes. And we could imagine for now that they just contain a lot of examples of their classes ready to be generated, ready to be emitted. That's model B and model C. And you can see that it seems likely that model A was going to emit a different distribution of values, of colours, than model B. And model B is going to emit a different distribution to model C. So let's see what we could do with something like that. So we've got three generative models. All they can do is emit observations. Already, and keeping things very abstract without saying anything about what type of model we're using, let's see already that this simple set of three models, A, B and C, one for each of the classes, can do classification. Even though all the models can do is emit observations, they don't make decisions, but we can use them to make decisions, to classify.
08:2008:34 So let's see how a set of generative models can together be used to classify new samples, test samples. So here comes a data point. There it is. It's a ball and its feature is green.
08:3408:39 And I would like to answer the question, which of these models is most likely to have generated it?
08:3908:44 And if I can find the most likely model, I'll label this data point with the class of that model.
08:4409:17 So we're keeping the models abstract. We're pretending that they just contain big bags of balls. So that's what our models look like inside. So we try and generate from the model and we make the model generate this observation. One way of thinking about that, if you want to think in a frequentist way, a counting way, is to emit a batch of observations from each model and count how many of them were green. That's the frequentist view. The Bayesian view is just to say, what's the fraction, what's the probability of emissions of observations from a model that are green?
09:1709:42 And we can look at model A and it's pretty obvious that it's never going to emit a green ball. So the probability of this green ball having been emitted from model A is a probability of zero. I'll write it with one decimal place for precision. Look at model B. It's also clear just by looking that model B never emits green balls either. And the probability of that emitting a green is 0.0.
09:4210:14 Now let's look at model C. You can see that in model C, it emits blue balls and yellow balls and green balls, let's say with approximately equal proportions. So there's about a 0.3 probability of model C emitting a green ball. Clearly, this is the largest probability of the three. We'll classify this as being of class C. Notice that each model could compute only its own probability. It didn't need to know anything about the other models to do that.
10:1411:28 And each model on its own made no decisions. It merely placed a probability on this observed green ball. We then made the decision by comparing the probabilities. So the classifier is the comparison of the probabilities of these models emitting this observation. That's what makes these models simple, that they don't need to know about each other. They're all quite separated. And in particular, when we learn these models from data, we'll be able to learn them from just data labeled with their class. They won't need to see counter examples. They won't need to see examples from other classes. And that's the simplicity of the generative model. Other models are available, models that might be called discriminative models, that know about where the boundary is between, for example, class A and class B, and can draw that decision boundary and directly make the decision as to whether an observation falls on the A side of that line or the B side of that line, of that classification boundary. But these generative models do not do that. We have to form a classifier by comparing their probabilities. Let's classify another observation. So along comes an observation, and our job is to label it as being most likely class A or class B or class C.
11:2812:50 We just do the same operation again. We ask model A, model A, what is the probability that you generate a red ball? We can see just visually for model A, maybe it emits red balls about two thirds of the time. So model A says the probability of me emitting a red ball is about 0.6. Ask model B, model B, what's the probability that you emitted a red ball? Well, we should look, and model B emits red balls and yellow balls and blue balls. They're in about equal proportion. So it says I can emit red balls, and I can do it about a third of the time. Model C, what's the probability that you emitted a red ball? Well, it never emits a red ball. Zero. How do we make the classification decision? We compare these three probabilities, we pick the largest, and we label this test sample with the class of the highest probability model, which is an A. So this is labeled as A. So we've done classification. So you get the idea. Along comes a sample. You ask each of the models in turn, what's the probability that this is a sample from you, from your model, from your internal distribution, whatever that might be. We compute those three probabilities and pick the highest and label the test sample with that class. So as simple as generative models are, they can do classification simply by comparing probabilities.

Log in if you want to mark this as completed
Excellent 77
Very helpful 4
Quite helpful 4
Slightly helpful 5
Confusing 2
No rating 0
My brain hurts 1
Really quite difficult 1
Getting harder 3
Just right 81
Pretty simple 6
No rating 0