Gaussian distributions in generative models

Using Gaussian distributions to describe data and as generative models

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN AUTOMATIC TRANSCRIPT WITH LIGHT CORRECTIONS TO THE TECHNICAL TERMS
Let's move on from discrete things, from coloured balls to continuous values because that's what our features for speech recognition are
they are extracted from frames of speech, and so far where we've got with that is filterbank features: a vector of numbers.
Each number is the energy in a filter (in a band of frequencies) for that frame of speech.
So we need a model of continuous values
the model we're going to choose is the Gaussian, so I'm going to assume you know about Gaussian from last week's tutorial.
But let's just have a very quick reminder.
Let's do this in two dimensions.
So I just take two of the filters in the filterbank and draw that two dimensional space.
Perhaps I'll pick the third filter and the fourth filter in the filterbank.
each of the points I'm going to draw is the pair of filterbank energies: a little feature vector.
So each point is a little two dimensional feature vector containing the energy in the third filter and the energy in the fourth filter
so: lots of data points
I would like to describe the distribution of this data with a Gaussian, and it's going to be a multivariate Gaussian
the mean is going to be a vector of two dimensions.
and its covariance matrix is going to be a 2x2 matrix.
I'm going to have here a full covariance matrix, which means I could draw a Gaussian that is this shape on the data.
We've made the assumption here that the data are distributed Normally and so that this parametric probability density function is a good representation of this data.
So I can use the Gaussian to describe data.
But how would we use the Gaussian as a generative model?
Let's do that.
But let's just do it in one dimension to make things a bit easier to draw.
I've got my three models again
By some means (yet to be determined) I've learned these models - they've come from somewhere
these models are now Gaussian
So this is really what the models look like.
Model A is this Gaussian
It has a particular mean and a particular standard deviation.
along comes an observation.
So these are univariate Gaussian: our feature vectors are one dimensional feature vectors
So along comes a 1-dimensional feature vector (it's just a number)
the question is "Which of these models is most likely to have generated that number?"
the is number 2.1
remembers that the Gaussian can't computer probability - that would involve integrating the area between two values.
So, for 2.1 all we can say is, "What's the probability density at 2.1?"
So off we go
2.1 this value ... 2.1 this value ... 2.1 this value.
Compare those three.
Clearly, this one is the highest.
And so we'll say this is an A.
That's how we'd use these three Gaussians as generative models.
We'd ask each of them in turn, "Can you generate the value 2.1?"
For a Gaussian, the answer's always "Yes!", because all values have non-zero probability (density)(.
So of course, we can generate a 2.1.
What's the probability density at 2.1?
We just read that of the curve because it's a parametric distribution
and compare those three probability densities
so we do classification with the Gaussian.
Let's just draw the three models on top of each other to make it even clearer.
What's the probability of 2.1 being an A or a B or a C?
Well, just go to 2.1 and you've got this probability density, this probability density, and this probability density
Clearly A is the highest and it's an A
By drawing the models on top of each other, we can actually see the implicit decision boundaries between the classes.
It's obvious that up to here A is always the highest value.
So this whole region here will always be labelled A
this region in the middle here, the B probability density function is higher than the other two so everything in that region will be labelled as a B
and for the remainder, going this way, everything will be labelled C
So thes three Gaussian generative models - whilst not knowing anything about each other - when we lay them on top of each other, we can see that they do form decision boundaries.
Thiese three Gaussian form a classifier, and it has boundaries here and here
and divides feature space (which is this whole range of the variable) into three parts and labels one with A, one as B, one with C
but those boundaries and never stored.
We never need to know those
they arise simply by comparing the probabilities (in fact, the probability densities) of any observation value
that works in two dimensions.
pick some pair of filters
Let's pick the the 4th one and the 5th one
Let's have two classes now, so let's have class green.
It's got the mean here.
That's the mean of class green.
It's got some standard deviation in the two feature directions, so let's draw a Gaussian there.
This has got full covariance
Let's have class Purple
Let's have it's mean and it's standard deviation in all directions.
Maybe that looks like this: it's much tighter
ow there's going to be - as we move around this feature space, trying all these different points - some will be more likely to be green and some will be more likely to be purple
there is an implicit decision boundary between the two classes, maybe the decision boundary is going to look something like this, perhaps
everything here has got a higher probability (higher probability density) of coming from the purple distribution and everything here has got a higher probability density of coming from the green distribution.
This classification boundary between the two classes is never drawn out explicitly.
If it was that will be called a 'discriminative' model.
That's not what we're doing
This is a generative model and we have to find that boundary simply by comparing the probability densities of the two models.
And if there are more models, we will get more complicated decision boundaries.
So that completes the first part of the class.
We want to use Gaussians because they have nice mathematical properties.
We want to use generative models because they're the simplest sort of model.
We know there isn't anything simpler.
We're going to use Gaussians as the generative model of the feature vectors - feature vectors coming from frames of analysis from our speech waveform
we've seen that generative models can be used to classify
ultimately, the problem of speech recognition is one of classification.
It's one of saying: "Which words were said, out of all the possible words? Which ones were most likely?"
So we're going to do that through generative modelling.
Now our Gaussians are going to be multivariate: they're going to be in some high dimensional space feature space - it's going to have tens of dimensions.
At the moment it's the number of filters in the filterbank.
And if we were to model covariance, that would have a very large dimension covartiance matrix
[The number of entries in the covraiance matrix would] be proportional to the square of the dimension of the feature vector
That's bad.
So we're going to now do something to the features so that we don't need to model covariance.
We going to do feature engineering.