Concept of generative models

One of the biggest challenges in this course is to think in terms of generative models. The concept was introduced in Module 7, so this is a recap.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS
So we're going to go from dynamic time warping to this hidden Markov model on the conceptual leap we're going to have to make in today's lecture is going to be thinking about not templates or exemplars things that look like the thing we're trying to recognise.
Trying to recognise a word.
The template is also a word.
We're going to get rid of this template.
We're going to replace it with a statistical model.
The statistical model is going to have a very particular form that is going to be what we call the generative model.
We're going to have to understand why on earth we would try and make um, all that Khun generate things if what we're really trying to do is classify things.
I've been working the framework called the generative model framework, and that's the conceptual leap in today's lecture.
You can get that then we've got The main point of today's lecture on the hidden Markov model is a simple, elegant, mathematically elegant and powerful form of generative model.
It's nice to work with mathematically, it's computational efficiency, so we're going to develop this generative model, the hidden Markov model.
So let's just go straight into the concept is going to seem a bit strange coming straight out of the blue.
Why we would do this more See, as we go through today's lecture, that it's a nice way of thinking about modelling things and then using those models to do things on the thing we're going to do with the misclassification in speech recognition.
We could imagine doing other things with these models.
For example, we could imagine synthesising speech for these models, and theoretically, this model will be able to synthesise speech.
If you take next semester's course on speech, synthesis will see that by developing a model a bit further, we can indeed synthesise speech from these hidden Markov models.
We can use them as truly as generated models in this cause.
We're going to describe his generative models and then build a classifier out of a set of such generative models.
So it's a bit of a conceptual leap that we need to get.
So here is a very abstract picture, ofsome generative models.
So imagine we've got a very simple classification problem.
We've got just two classes, plus a Class B.
Maybe they're just two words, yes, and we're trying to build a very simple speech recognition system when we speak into it, says.
Did the person say Yes, I did say No.
Always out.
One of those two answers.
Whatever you say.
So we're going to need two models one for each of those two classics.
A model of class A Maybe that's it.
The word yes, on the model of Class B.
Maybe that's the word.
No, I'm going to build a classified like this.
The model off.
Yes, he's going to be a generative model, and it's the model that's going to generate observations.
So observations mean that's what we see that comes out of the models.
We observe it.
Observation.
The observations are always going to be in the domain of feature vectors or sequences of feature vectors.
So they're going to be MFC sees or sequences of MSC sees for speech.
No actual speech way forms.
Just these feature vectors on the model of A is going to be really good at generating examples of the word yes, and it will randomly generate examples of the word Yes, so imagines.
Got a button on it, and every time I press the button, it squirts out an example of the word Yes, which is going to be a sequence of MFC sees in this case good to have some duration on each of these.
MCC has described spectral envelope as we go through this word.
Yes, press the button again, and it spits out another example of the word Yes.
And each time it does that the duration might be slightly different within the natural duration of this word.
Yes, on the spectral amulet will naturally vary around the average for the word yes, so it's doing something more powerful than dynamic time warping already, every time we pressed the bottom, we don't just get the average those who don't get the same thing every time we randomly generate from this model.
So this model captures not just the average way of saying the word yes, but also the allowed variation around that in terms of durations and in terms of spectral envelope variation.
So the model has a concept of mean the average on a concept of the variance for the standard deviation around that average, and he's going to learn that was across from data.
You don't have to type those in and How would we then build a classifier from such generative models? Well, we'll learn a generative model of each class, and these will be learned separately and independently.
So the generative model of the word.
Yes, all we need to train that modern is lots of examples of the word.
Yes, it doesn't know anything about any other words.
It's not particularly deliberately bad, generating examples of the word.
No, there's just never seen any, so probably not generate them very often.
Run the variation, So be very a long way from the word Yes, it might end up sounding a bit like No, but it's kind of like Model B will be only GIA only train on examples off its class of the word no, and it will learn to be a good generator off work.
Now that means probably not regarded generating the word Yes.
So we don't see the some power to this idea of generative models because we can train these models just from positive training examples.
So they're not learning to discriminate between two classes.
They're just learning to be a model of the distribution of one particular class.
So how would we build a classifier from such a set up, I was going to be extremely easy.
Let's take something that we've got to classify.
So here's the sequence of MFC sees.
It's a word we don't know if it's yes or no, we got to decide.
Is it more likely that this is a yes or a no? So go to model A.
I will randomly generated from Ole.
There's two ways of thinking about that Random generation.
One is that will press the random, generate bottom many, many times over and over again.
Generally, lots and lots and lots of examples.
We'll count how many of them look like this unknown thing.
And use that for a measure of how, like, yes, this unknown things.
So you can think of it as this frequent this way, we'll sample millions of times from model a cow out of all those millions.
How many look pretty like the thing we're trying to classify? I was actually a better way of thinking about that.
We'll see when we come on to the EU's use a Gaussian and Hidden Markov model degenerate.
We can actually force model A to generate precisely this sequence, and in doing so it could calculate the probability of it generating that sequence.
Put a probability on it.
And you can think of that as the proportion of times of pressing this bottom that we actually generate, the thing we're trying to classify.
So you press this little button here, we press it on model A lots and lots and lots of times.
And some of those times we end up matching the thing we're trying to classify.
We'll do the same for Model B and maybe for Model B we much more often.
So we'll say this is a Class B.
It's the word, no, but we don't actually have to do those millions of generations.
We can just directly compute the probability the proportion of times we would have generated This will become clear as now we develop this model.
Okay, so there's just this generative model frame.
It's actually a very common framework in machine learning.
We might use it when we really do want to generate new examples.
So she's in speech synthesis, but more commonly we might use it.
We want to classify between things, will build generative models every one of the classes that we're trying to identify.
And they were just that.
The models fight over the test examples.
Whichever one is the best generating it was, it's most likely that it generated.
It wins, and we label the test example with that class.
So we need this ensemble of models or competing who's the best of generating this particular unseen token, Whoever is the best.
That's what label so individual models are not.
Classify us all.
An individual model can tell us is, what's the probability that this unseen test token was generated by this model? In other words, how close is it to my average? How much does it deviate? Is this a likely sounding a yes, or is this pretty unlikely sounding? Yes, So they're just going to be probability distributions on all the intel is is, Is it like the sort of things I would normally generate was unlike those things to do classification.
We need multiple models and just compete.
And what will compete on is the probabilities.
There are other frameworks for building classifieds.
We could actually more directly solve a classification problem.
We could build models that exactly discriminate between yes and no, and they produce an output saying it's a yes or a no trained on positive magnetic negative examples.
But the market models are generally not like that.
In general, the main way of training them is a generative model.
What did ashes A much simpler mathematical framework I can already see.
You might be more practical because each model only see positive examples rather than a lot of training data.
So we're going to build a generative model, and it's going to have the properties that we want that will make a good classifier when combined with general models for all the other classes.
So all he needs to do is that it needs to be able to generate anything.
So any sequence there's some species again, the word that we'd like to classify.
It's unknown.
Our generative model has to be able to generate the sequence.
However far it is from the average model you can't fail it must generate.
It must be able to assign a non zero problems any sequence.
But if the sequence is like the sort of things that was trained on, then it should give a high probability of high score.
And if it's very unlike the things it was trained on that it should give a low score low probability.
So we're effectively building something that compares unseen things with labels examples in the training corpus.
But those training examples in the corpus are not stored and used for direct comparison there.
Distil into a proble stick model and that probably stick model then gives the score.
So we're going to generalise are going to learn from a lot of training examples to still them down into a small model with a small number of parameters.
And that model could then say, for every unseen thing.
How far away is that From the average of the things that we saw during training is very close.
The case is a high probability that the same class is very far away because it's a low probability being in the same class.