Intro to Generative Models

Introduces the concept of generative modelling.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN AUTOMATIC TRANSCRIPT WITH LIGHT CORRECTIONS TO THE TECHNICAL TERMS
This module has two parts.
In the first part, we're going to start building the hidden Markov model, but only get as far as deciding to use Gaussian Probability Density functions as the heart of the model
and that will make us pause for thought and think about the features that we need to use with Gaussian Probability Density functions.
So in this first part, going to introduce a rather challenging concept.
And that's the idea of a generative model and we'll choose the Gaussian as our generative model for the features we have speech recognition.
The main conclusion from that part will be that we don't want to model covariance
that is, we're going to have a multivarite Gaussian - that's a Gaussian in a feature space that's multi-imensional: feature vectors extracted from each frame of speech.
We would like to assume that there's no covariance between any pair of elements in the feature vector
for the features that we've seen so far (filterbank features) that's not going to be true.
So those features are not going to be directly suitable for modelling with a Gaussian that has no covarianceparameters
We call such a Gaussian "Diagonal Covariance" because only the diagonal entries of its covariance matrix have non-zero values.
We'll conclude that we want diagonal covariance Gaussians.
We'll realise that filterbank features are not suitable for that.
And we need to do what's called 'feature engineering' to those features: some further steps of processing to remove that covariance
that will result in the final set of features that we're going to use in this course, which are called Mel Frequency Cepstral Coefficients (MFCCs).
To get to them, we'll need to understand what the cepstrum is.
Let's orient ourselves in the course as a whole.
We're on a journey towards HMMs.
We've talked about pattern matching.
We've seen that already in the form of dynamic time warping
there was no probability there
we were comparing a template (an exemplar) with an unknown and measuring distance.
The distance measure was a Euclidean distance for one feature factor in the exemplar compared to one feature vector in the unknown.
The Euclidean distance is really what's at the core of the Gaussian probability density function
our conclusion from doing simple pattern matching was that it fails to capture the variability of speech: the natural variability around the mean
A single exemplar is like having just the mean.
But we want also the variability around that.
That's why we're moving away from exemplars to probabilistic models.
So to bridge between pattern matching and probabilistic generative modelling, we're going to have to think again about features because there's going to be an interaction between the choice of features and choice of model.
In particular, we're going to choose a model that makes assumptions about the features.
And then we need to do things to the features to make those assumptions as true as possible.
We've already talked enough about probability and about the Gaussian probability density function in last week's tutorials.
We've also covered why we don't want to model covariance.
We also looked at human hearing, and we decided to take a little bit of inspiration from how the cochlea operates.
The core reason we don't want to model covariance is that it massively increases the number of parameters in the Gaussian, and that would require a lot more training data to get good estimates of all those parameters
We're not going to explicitly model human hearing.
We're going to use it as a kind of inspiration for some of the things we're doing in feature extraction.
So without repeating too much material from last week, let's have a very quick recap of some concepts that we're going to need.
We're modelling features such as the output of a filter in the filterbank.
That's a continuously-valued thing.
So we're going to need to deal with continuous variables, but also we're going to need to deal with discrete variables.
At some point, we're going to decide speech is a sequence of words
words are discrete: we can count the number of types
maybe we're going to decompose the word into a sequence of phonemes
that's also a countable thing: a discrete thing.
We're going to need to deal with both discrete and continuous variables, and they are going to be random variables.
They're going to take a distribution over a set of possible values or over a range of possible values.
An example of a discreet variable, then, will be the word, and in the assignment that will be one of the 10 possible values.
An example of a continuous variable will be the output from one of the filters in the filterbank, or indeed, the vector of all the outputs stacked together.
That's just a vector space, and that is also a continuous variable in a multi-dimensional space
we need different types of probability distribution to represent them.
But apart from that, everything else that we've learned about probability theory and in particular this idea of joint distributions, conditional distributions, and the all-important Bayes' formula - the theorem that tells us how to connect all of these things - applies to both discrete and continuous variables.
It's true for all of them.
In fact, eventually we're going to write down Bayes' formula with a mixture of discrete and continuous variables in it: words and acoustic features.
But that won't be a problem because there's nothing in all of what we learned about Bayes that tells us it only works for one of the other.
It works for both discrete and continuous variables
So I'm not going to go over that material again.
I'm going to assume that we know that.
So let's proceed.
We can use probability distributions to describe data, as a kind of a summary of the data
when we do that, we're making an assumption about how those data are distributed.
For example, if we describe data using a Gaussian distribution we assume that they have a Normal distribution
we might even make some stronger assumptions: that they have no covariance.
But before choosing the Gaussian, let's stay a little abstract for a while.
We're not going to choose any particular model.
We're going to introduce a concept: a difficult concept.
This will be difficult to understand the first time you encounter it.
It's a different way of thinking about what we do when we have a model.
What is a model?
What is a model for?
What can it do?
We're going to actually build the simplest possible form of model.
So a generative model is actually a simple as it gets, because a generative model can only do one thing: it can generate examples from its own class
so there's a difficult concept to learn.
But it's necessary because it allows us to build the simplest form of model: the generative model.
So here's the conceptual leap.
Here are three models of three different classes A, B and C.
We keeping this abstract!
I'm not saying anything about what form of model we're using
These are black boxes
and the only thing that a model can do (for example, Model A here) is emit observations.
Let's make it do that.
The observations here are balls, and they have one feature: the colour.
So that just emitted a blue ball.
Well, that's all this model can do.
We can randomly emit observations: we can generate from the model.
So let's generate.
There's another one generating observations.
The observations are data points
and so on, and we keep generating.
So this is a random generator, and it generates from its distribution.
So inside Model A is a distribution over all the different colours of balls that are possible.
And it randomly chooses from that, according to that distribution, and emits an observation.
So keeping things abstract and not making any statements about what form the model has (it's not got any probability distribution) what are we going to say about model A?
Well, we could imagine it essentially contains lots and lots of data from Class A - like that
and we might have other models, too, for other classes.
And we could imagine for now that they just contain a lot of examples of their classes ready to be generated - ready to be emitted
What will B and Model C
and you can see that it seems likely that Model A is going to emit a different distribution of the values (of colours) to Model B
and Model B is going to emit a different distribution to Model C.
So let's see what we could do with something like that.
these are generative models: all they can do his emit observations
already, and keeping things very abstracte, without saying anything about what type of model we're using, let's see already that this simple set of three models (A, B and C - one for each of the classes) can do classification.
Even though all the models can do is emit observations.
They don't make decisions, but we can use them to make decisions: to classify.
So let's he how a set of generative models can together be used to classify new samples (test samples)
So incomes a data point.
There it is: a ball, and its feature is 'green'
I would like to answer the question "Which of these models is most likely to have generated it?"
If I can find the most likely model, I will label this data point with the class of that model
we'll keep the models abstract by pretending that they just contain big bags of balls
So that's what the models look like inside.
So we try and generate from the model, and we make the model generate this observation.
One way of thinking about that, if you want to think in a frequentist way (a 'counting' way) is to emit a batch of observations from each model and count how many of them are green - that's the frequentist view.
The Bayesian view is just to say, "What's the fraction / What's the probability of emissions of observations from our model that are green?"
Now we can look at Model A
it's pretty obvious that it's never going to emit a green ball.
So the probability of this green ball having being emitted from Model A is a probability of zero.
I'l write it with one decimal place for precision.
Look at Model B.
It's also clear, just by looking, that Model B never emits green balls either.
the probability of that emitting a green is 0.0.
Now let's look at Model C.
You can see that model C emits blue balls and yellow balls and green balls - let's say - with approximately equal proportions.
So there's about a 0.3 probability of model C emitting a green ball.
Clearly, this is the largest probability of the three so we'll classify this as being of class C
Notice that each model could compute only its own probability.
It didn't need to know anything about the other models to do that
each model made no decisions.
It merely placed a probability on this observed green ball.
We then made the decision by comparing the probabilities.
So the classifier is the comparison of the probabilities of these models emitting this observation.
That's what makes these model simple: that they don't need to know about each other.
They're all quite separated, and in particular when we learn these models from data we'll be able to learn them from just data labelled with their class.
They won't need to see counter-examples.
They won't need to see examples from other classes, and that's the simplicity of the generative model.
Other models are available: models that might be called 'discriminative models' that know about where the boundaries between - for example, Class A and Class B - are and can draw that decision boundary and directly make the decision as to whether an observation falls on the Aa side of that line or the B side of that line: of the classification boundary.
These models do not do that.
We have to form a classifier by comparing the probabilities.
Let's classify another observation.
So along comes an observation and our job is to label it as being most likely Class A or Class B or Class C.
We just do the same operation again.
We ask Model A: "Model A, what is the probability that you generate a red ball?"
We can see just visually from model A maybe emits red balls about two thirds of the time.
So Model A says "The probability of me emitting a red ball is about 0.6."
Ask Model B: "Model B, what's the probability that you emitted a red ball?"
Well, we look and model B emits red balls and yellow balls and blue balls they are in about equal proportion.
It says "I can emit red balls and I can do it about a third of the time."
"Model C, what's the probability that you emitted the red ball?"
Well, it never emits a red ball: 0.0.
How do we make the classification decision?
We compare these three probabilities.
We pick the largest and we label this test sample with the class of the highest probability model, which is an A.
So this is labelled as A
We have done classification.
So you get the idea, along comes a sample.
You ask each of the models in turn: "What's the probability that this is a sample from you / from your model / from your internal distribution (whatever that might be)?"
We compute those three probabilities and pick the highest and label the test sample with that class.
So, as simple as generative models are, we can do classification simply by comparing probabilities.