Videos

There are rather a lot of videos in this module, but many of them are very short. Total video to watch in this section: 70 minutes

Concept of generative models

One of the biggest challenges in this course is to think in terms of generative models. The concept was introduced in Module 7, so this is a recap.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS
So we're going to go from dynamic time warping to this hidden Markov model on the conceptual leap we're going to have to make in today's lecture is going to be thinking about not templates or exemplars things that look like the thing we're trying to recognise.
Trying to recognise a word.
The template is also a word.
We're going to get rid of this template.
We're going to replace it with a statistical model.
The statistical model is going to have a very particular form that is going to be what we call the generative model.
We're going to have to understand why on earth we would try and make um, all that Khun generate things if what we're really trying to do is classify things.
I've been working the framework called the generative model framework, and that's the conceptual leap in today's lecture.
You can get that then we've got The main point of today's lecture on the hidden Markov model is a simple, elegant, mathematically elegant and powerful form of generative model.
It's nice to work with mathematically, it's computational efficiency, so we're going to develop this generative model, the hidden Markov model.
So let's just go straight into the concept is going to seem a bit strange coming straight out of the blue.
Why we would do this more See, as we go through today's lecture, that it's a nice way of thinking about modelling things and then using those models to do things on the thing we're going to do with the misclassification in speech recognition.
We could imagine doing other things with these models.
For example, we could imagine synthesising speech for these models, and theoretically, this model will be able to synthesise speech.
If you take next semester's course on speech, synthesis will see that by developing a model a bit further, we can indeed synthesise speech from these hidden Markov models.
We can use them as truly as generated models in this cause.
We're going to describe his generative models and then build a classifier out of a set of such generative models.
So it's a bit of a conceptual leap that we need to get.
So here is a very abstract picture, ofsome generative models.
So imagine we've got a very simple classification problem.
We've got just two classes, plus a Class B.
Maybe they're just two words, yes, and we're trying to build a very simple speech recognition system when we speak into it, says.
Did the person say Yes, I did say No.
Always out.
One of those two answers.
Whatever you say.
So we're going to need two models one for each of those two classics.
A model of class A Maybe that's it.
The word yes, on the model of Class B.
Maybe that's the word.
No, I'm going to build a classified like this.
The model off.
Yes, he's going to be a generative model, and it's the model that's going to generate observations.
So observations mean that's what we see that comes out of the models.
We observe it.
Observation.
The observations are always going to be in the domain of feature vectors or sequences of feature vectors.
So they're going to be MFC sees or sequences of MSC sees for speech.
No actual speech way forms.
Just these feature vectors on the model of A is going to be really good at generating examples of the word yes, and it will randomly generate examples of the word Yes, so imagines.
Got a button on it, and every time I press the button, it squirts out an example of the word Yes, which is going to be a sequence of MFC sees in this case good to have some duration on each of these.
MCC has described spectral envelope as we go through this word.
Yes, press the button again, and it spits out another example of the word Yes.
And each time it does that the duration might be slightly different within the natural duration of this word.
Yes, on the spectral amulet will naturally vary around the average for the word yes, so it's doing something more powerful than dynamic time warping already, every time we pressed the bottom, we don't just get the average those who don't get the same thing every time we randomly generate from this model.
So this model captures not just the average way of saying the word yes, but also the allowed variation around that in terms of durations and in terms of spectral envelope variation.
So the model has a concept of mean the average on a concept of the variance for the standard deviation around that average, and he's going to learn that was across from data.
You don't have to type those in and How would we then build a classifier from such generative models? Well, we'll learn a generative model of each class, and these will be learned separately and independently.
So the generative model of the word.
Yes, all we need to train that modern is lots of examples of the word.
Yes, it doesn't know anything about any other words.
It's not particularly deliberately bad, generating examples of the word.
No, there's just never seen any, so probably not generate them very often.
Run the variation, So be very a long way from the word Yes, it might end up sounding a bit like No, but it's kind of like Model B will be only GIA only train on examples off its class of the word no, and it will learn to be a good generator off work.
Now that means probably not regarded generating the word Yes.
So we don't see the some power to this idea of generative models because we can train these models just from positive training examples.
So they're not learning to discriminate between two classes.
They're just learning to be a model of the distribution of one particular class.
So how would we build a classifier from such a set up, I was going to be extremely easy.
Let's take something that we've got to classify.
So here's the sequence of MFC sees.
It's a word we don't know if it's yes or no, we got to decide.
Is it more likely that this is a yes or a no? So go to model A.
I will randomly generated from Ole.
There's two ways of thinking about that Random generation.
One is that will press the random, generate bottom many, many times over and over again.
Generally, lots and lots and lots of examples.
We'll count how many of them look like this unknown thing.
And use that for a measure of how, like, yes, this unknown things.
So you can think of it as this frequent this way, we'll sample millions of times from model a cow out of all those millions.
How many look pretty like the thing we're trying to classify? I was actually a better way of thinking about that.
We'll see when we come on to the EU's use a Gaussian and Hidden Markov model degenerate.
We can actually force model A to generate precisely this sequence, and in doing so it could calculate the probability of it generating that sequence.
Put a probability on it.
And you can think of that as the proportion of times of pressing this bottom that we actually generate, the thing we're trying to classify.
So you press this little button here, we press it on model A lots and lots and lots of times.
And some of those times we end up matching the thing we're trying to classify.
We'll do the same for Model B and maybe for Model B we much more often.
So we'll say this is a Class B.
It's the word, no, but we don't actually have to do those millions of generations.
We can just directly compute the probability the proportion of times we would have generated This will become clear as now we develop this model.
Okay, so there's just this generative model frame.
It's actually a very common framework in machine learning.
We might use it when we really do want to generate new examples.
So she's in speech synthesis, but more commonly we might use it.
We want to classify between things, will build generative models every one of the classes that we're trying to identify.
And they were just that.
The models fight over the test examples.
Whichever one is the best generating it was, it's most likely that it generated.
It wins, and we label the test example with that class.
So we need this ensemble of models or competing who's the best of generating this particular unseen token, Whoever is the best.
That's what label so individual models are not.
Classify us all.
An individual model can tell us is, what's the probability that this unseen test token was generated by this model? In other words, how close is it to my average? How much does it deviate? Is this a likely sounding a yes, or is this pretty unlikely sounding? Yes, So they're just going to be probability distributions on all the intel is is, Is it like the sort of things I would normally generate was unlike those things to do classification.
We need multiple models and just compete.
And what will compete on is the probabilities.
There are other frameworks for building classifieds.
We could actually more directly solve a classification problem.
We could build models that exactly discriminate between yes and no, and they produce an output saying it's a yes or a no trained on positive magnetic negative examples.
But the market models are generally not like that.
In general, the main way of training them is a generative model.
What did ashes A much simpler mathematical framework I can already see.
You might be more practical because each model only see positive examples rather than a lot of training data.
So we're going to build a generative model, and it's going to have the properties that we want that will make a good classifier when combined with general models for all the other classes.
So all he needs to do is that it needs to be able to generate anything.
So any sequence there's some species again, the word that we'd like to classify.
It's unknown.
Our generative model has to be able to generate the sequence.
However far it is from the average model you can't fail it must generate.
It must be able to assign a non zero problems any sequence.
But if the sequence is like the sort of things that was trained on, then it should give a high probability of high score.
And if it's very unlike the things it was trained on that it should give a low score low probability.
So we're effectively building something that compares unseen things with labels examples in the training corpus.
But those training examples in the corpus are not stored and used for direct comparison there.
Distil into a proble stick model and that probably stick model then gives the score.
So we're going to generalise are going to learn from a lot of training examples to still them down into a small model with a small number of parameters.
And that model could then say, for every unseen thing.
How far away is that From the average of the things that we saw during training is very close.
The case is a high probability that the same class is very far away because it's a low probability being in the same class.

The Gaussian as a generative model

We can think of the Gaussian generating, or "emitting", observations.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS
So we better have a think about the classes that were trying to recognise in the lab.
The classes are going to be just whole words.
They're 0 to 9, the whole words, and that's fine.
If we gotta close recovery.
We know that when test time people are gonna walk up to the system and say something outside at the cabaret that would work fine for small recoveries closed, fixed.
So if we're building a voice dialling application for our phone, we only allow people to say digits.
It will be a closed set.
Whole words would work fine, so this is perfectly reasonable in some applications, but more from the not We're going to build systems that have a large recovery.
We'd like to fail to recognise words that we never even saw training time, Same problem speak synthesis.
The corpus might not contain words that were trying to classify, and therefore we can't have a whole world models because there'll be nothing to train them on.
So we'll break the words down into smaller units, and we probably use something like phonemes.
In fact, we're going to break things down a bit further info names, and we'll see why, as we go through the course into something sub frenetic, and for now, I just think that's probably because within a single phoney phone spoken example of a phoney name, the spectrum changes sufficiently that we might want to model the beginning, the middle and the end of a phone.
You might want to divide it into some frenetic units, some smaller units.
So let's now think about the galaxy and distribution, which is why we must honour our features into every species in the first place and see that this is indeed a form of generative model.
It's not just a way of describing data.
It could generate data that has all the properties that we need for our generative model.
In other words, it gives a non zero probability to everything.
It never goes down to zero things near the mean get a high probability.
Things far from the mean get a low probability.
So the gassing has all the right properties.
It's mathematically very convenient, so that's our first choice of generative model.
So can we then just use a galaxy and is the generative model? Whether the classes of words or phone aims for something smaller than that.
Okay, so that time warping was working Fine.
But it has limitations, and the limitations are really that the exemplars need to be stored on.
Comparing to single exemplars doesn't really help us generalise across all of those exemplars.
So the history speech recognition went like this.
We have dynamic time warping of single example class people immediate realised that doesn't capture the variation about that example very well.
So then we have dynamic town walk with multiple templates for class.
So we represented the variance by having 10 or 20 or hundreds of examples per class and use that to capture the variation on, then realised that that's not a very efficient or effective way of doing things.
So we capture that variance in a mathematical model, a statistical model, and throw away the exemplars.
So we're going to use Galaxy ins now instead, off the distance, measure the local distance measure.
We need to worry about how we might build than a model of sequences of things from gallstones.
So we don't have any time warping Dunham time walking just two things lines, two examples, and then it computes local distances and adds them up conceptually is completely fine.
The local distance measure we looked out was rather naive.
It was a simple geometric measure.
It doesn't count well for various.
So that's what we're going to fix by using Gazans instead.
So we replaced distance measures with probability distributions.
They're very much like distance measures, but they normalise for the natural variants that started the training.
We only really interesting.
One probability density function.
That's the calcium, because it works well.
Mathematically, everything else is harder to work with.
We're now going to think about the gassing of the generative model, build it up into a generative model not just a one frame but of sequences of Frank's.
But the galaxy is going to be doing most of the work.
Okay, sequence part is just to make things happen in the right order, so back to our rather abstract picture of a generative model.
So we've got this model.
It's just a box.
Black boxes red box, and it's got a button on it, and we can press the button and it generates an observation.
Press the button outcomes and observation.
Remember, observations were always FCC vectors.
So press the bottom outcomes observation fine.
Press the button again, outcomes.
Another observation.
And again, another observation.
So a simple way of generating sequences is just a pre press the bottom lots of times and then lined, the things that generated a sequence of things were generated.
The sequence is going to have very special statistical properties.
I'm actually not going to be ideal for speech, but we'll worry about that later.
How really is this galaxy and generating? Well, there's our calcium, actually.
Let's do this in the car.
So that's the coefficient.
That's maybe one element of this M sec vector.
I can only draw it in one dimension, but it's always going to multi dimensional.
There's going to be some coefficient.
Maybe it's the third capital coefficient, and that's the probability density off the third capsule coefficient.
How does it mean? We learned from the data is just average of the trading samples on DH variance about that system.
Deviation on.
When we press the bottom, we just randomly sample a value along this access, randomly generated number.
The probability of generating particular value is just proportional to the height of this curve.
So we hit this button again and again, and again.
Well, pretty often sample things quite near the mean, because that's very likely.
Just occasionally we might put a sample down here one in 1000 times, one in a million times a generative model.
But it likes to generate things near the beans.
So we got lots and lots of samples near the mean on things far from mean, a lot less often.
And every time we generate from this model, we're just independently sample from this callous and distribution.
And that means this observation.
It's statistically unconnected, un correlated, independent off the next one, the next one and the next one on the next one.
This secret has a very special property, these samples independent, which means every time you press the button that independently generates, it doesn't matter what we did the previous time step.
We don't need to see the past or the future, and they've identically distributed in other ways.
They all come from the same Gaussian distribution.
You want the fancy statistical term.
That means they II d rather naive and simplistic.
That doesn't sound like speech behaved like that, really, because the sound slowly evolves.
That might be a bit of an issue so Garcia could be a generative model.
We press a button on it, and it gives us a random sample.
Control the picture in one dimension but random samples of the vectors of FCC from this multi dimensional calcium and to generate sequences of things we could just repeatedly generate from the model on generator sequence that goes through time, we've got a generative model off sequences of things.
Let's think about whether that's going to be good enough to replace our dynamic time working model.
This bit here, this idea of generation, every time we generate a vector, the model is a byproduct can tell us the probability of having done that.
Let's call this a respecter.
Oh, this sequence owe this one to tell us the probability off one as a by product of generating so we can actually randomly sample from model.
Or we can show it a sequence and say, Just tell me the probability that you would have generated this rather than actually doing the simulation.
So let's just make that really concrete how we were really going to do that.
So we've got this galaxy and it's got a two promises of mean standard deviation promise her name here is just X on when we randomly generated this galaxy and we're quite likely to pick things near me.
Another way of thinking about that is that, given the value given an observation that has a value X, we can calculate the probability, in other words, how how many times, on average, stay out of 100? Would we expect to generate this value? We could just read it off the curve.
So if you tell me a value for X Michael X one, all they need to do is go up the curve.
Read off.
The value is just the height of the curve on DH.
That tells me the probability.
So if I try some values, we ask this galaxy and to generate them, it can tell us immediately what the probability of that happening is.
So let's try generating some values.
So let's generate this value is really close to the mean.
So what? What do we think? The probability of generating this party would be way high or low? All right.
In other words, if we press the button millions of times, we'd expect to see this value quite often we'll use close to it.
Go up there.
Read it often indeed.
Get a high value.
You could have a value here.
He's gonna be high or low.
Hi as well.
Of course, it's also very close to me.
It's just the other side, but symmetrical.
So we go.
If I get the same value, just occasionally maybe get observation here.
It's very far from the mean, very low probability.
So this Garcia can generate this unlikely value here.
It just doesn't like doing it very much.
But it'll give us a low probability of doing so.
These tales never go to zero.
They go on forever.
You never quite get Teo just get really, really small.
And so this probability function can give a non zero value tow any observation, even crazy things right out here that clearly not off this class.
They could give a non zero value, but it will be very, very small.
So as I got, Ian has a generative model and we don't need to do this.
The mathematical simulation.
We can just directly read off the probability for any value.
What is the probability that the gassing generated it? In other words, what is the probability that this observation belongs to this class that were modelling with this calcium.
Is it? Yes.
Is it? No.
Much of this is the galaxy and four.
Yes.
And that is just the probability that this observation belongs to the word Yes, for an individual frame.
And then we could just do that for each of the frames in sequence.

The Bayesian view

This is a way of understanding probability that will be most useful for understanding HMMs.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS
so school.
Maybe do some basic statistics.
You got great experiment.
We'll do the maths class, get some coins, tossed them, write down how many times that fills a lesson very nicely, and then work out the probability density ofthe property distribution off a coin or die or anything like that.
So the way of thinking about probability of school level, maybe a lower undergraduate level is this idea ofthe events that happened many, many times.
We count them and we look at the proportion of times things happen, how many times it heads, how many times it tails.
That's a perfectly reasonable way of thinking and reasoning about the real world, about probability.
But in terms of modelling in terms of statistics, actually more powerful way of thinking about it has expressed these things in a formulaic way that lets his reason with them.
Let's just take the product of two different probabilities.
That's not to think of counting on random events.
This is thinking about how things happen in the real world simulation.
We're going to take this other view, which is this base in view.
This is more powerful way of thinking about just the same thing If I toss this coin now, ask you to guess what have you got here? Heads or tails? The frequent this way of saying that would be well, we'll talk about it 100 times.
I'm gonna guess that 50 of them will be herds and 50 of them will be tails on average.
The Beijing and way of thinking about it would be at the moment.
This is half heads and half tails, 0.5000 point five taels.
It's in an indeterminate state.
Until we look at it, it's half heads and half tails.
It's in both states at once, and that expresses our degree of belief.
So we've got a 0.5 of the degree of belief that it's going to be hands on a 0.5 degree of belief that is going to be tails.
So in our mind, in our belief state, rather than saying it's heads and just being right half the time, we're going to say I half believe it's heads and I half believe it's tails.
I'm going to describe my beliefs not by picking one, not by guessing account out of lots of simulation runs was goingto have this mythical police state good of uncertainty so it could represent them.
My distribution in my mind, my beliefs about which way up this coin is is this Distribution was destroyed in a simple way.
That's going to be the probability off this random variable X, which is the very walls of the body of the coins.
But that's one.
Heads spell it right and that's tales and they're going to be no 0.5 got a distribution over the values.
It's a uniform distribution because I believe the coin is fair.
Let's 0.5 point five, and that's what I've got in my mind.
That's my model of how this queen behaves.
I don't need to toss it 100 times.
I just believe that its distribution.
So if we're learning things from data, we might indeed count things.
But our model way of reasoning, about things, our way of storing things, the parameters of the model are going to be in this framework.
This basic framework in the data does exist.
Data have individual points.
There is a number off.
Then we can count them.
We can look at their values, so we're gonna have a basically of the generative model we've got the model of A.
It generates observations from Class A and rather than saying right will generate 1000 things will look at their distribution Inside, the model is just going to be a model of the distribution that's going to be this continuously valued distribution.
The distribution is going to be Gossens.

A generative model of sequences

How to generate a sequence of observations from our model.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS
wait.
Where we going? We're going to get to a model that could generate sequences of Mel frequency capital coefficients and therefore can assign a probability to any unknown sequence of such nfcc coefficients.
And then we'll compare the problem of probabilities for each of our models from different classes, which everyone is highest.
Will assign that label so well so far.
Got a calcium that could generate a single dime of C C vector.
That's okay.
We worked out how to calculate the probability we just read it off the curve that works for one dimension, two dimensions, any number of dimensions.
We thought initially about how we might generate sequence of observations we just repeatedly generate for the calcium we spotted that we're just going to state that these have this special statistical property, that they're independent and they have an identical distribution.
So we need to do now is just formalise that a little too right that it's a very simple probability notation.
What is the probability, then of a sequence of things happening? Can we compute the probability of several things happening based on the problems of the individual things? We're going to start with the simplest possible case.
And that's the case where the individual things are statistically independent from each other.
We're going to actually turn up to make this assumption for Nfcc vectors evil.
It's not true.
We'll fix it up a bit later.
This is so convenient.
This is such a nice way off thinking about data.
So let's start with some other events that clearly independent from each other.
Okay, so every morning I get up from offers is particularly and it's always dark, I hope my sock draw and I don't bother looking to stick your hand in and just pull out a pair of socks.
Okay.
I don't have five different colours of socks.
I'm very boring in my dress sense.
I have a drawer full of socks on.
They're uniformly distributed across these five colours.
One of colour is blue.
Okay, So would anyone like to tell me what is the probability off me? Wearing blue socks tomorrow? Mixed dark.
Just reach in five different colours.
Randomly sample.
20% okay.
Or 1/5.
Have we want to say it on fifth.
No point to 20%.
Express it.
How you like? Okay, that's fine, right? Does that depend on the weather.
It was dark, so let's think about the chance of it raining tomorrow.
Okay, let's just make a very simple published it model off.
Winter weather in Edinburgh doesn't rain too much of the time.
Let's say it rains one day every three days.
So what's the unless just say the property it raining tomorrow has nothing to do with today or the day after a re naive model of the weather.
So what's the probability of it raining tomorrow? 33%.
1 3rd, one third? Well, no 0.33 Okay, now, once the probability off both things happening together that it rains on day wear blue socks, Does one of them depend on the other? They're completely independent, so they would like to propose.
What's the problem of both things happening? Given that it's one third chance of it raining 1/5 chance that I'll be wearing blue songs well over 50 50 15? Yeah, So he's gonna multiply the two things together.
Okay, 1 50 people really comfortable with that prediction.
14 out of 15 times, something else will happen.
For example, it won't rain and I'll wear blue socks will reign and I won't wear blue socks.
It will rain and I wear red socks.
We had all of those things together that will be 49 to 15 times, but one out to 15 times.
Okay, So our beliefs state in our brains.
Not right now, because it's today, not tomorrow.
Our belief state is that 1/15 of us believes that these two things will happen tomorrow and the rest was, believes all these other combinations.
So we're maintaining a distribution over all the things that happen tomorrow.
We don't No, we don't need to choose between them.
We're just going to maintain that all of them are possible.
The probability of this particular one is 1 15 1/15.
So that's the probability.
So it's very intuitive and reasonable.
So let's just write that down.
Let's use some notation to get over the fear factor of the notation.
So let's give some notation to these things.
So we use big cup big couple letters too.
Note event.
Random variables have outcomes.
So X is the random variable.
It will rain tomorrow.
The next Khun, take the value's off.
Yes, it rains.
Uh, it doesn't rain.
And why is going to be the random variable.
That's colour of my socks.
And that could take the colours.
Red, blue, five different values, Some notation.
We've already decided that probability that why equals blue equals one in five.
Happy with this notation? Too scary apologies to anyone who's done for so we can write things expressions like probability that some random variable around the variables of variable that could take multiple values with some distribution.
You might want to make us to discuss model off writing things like this.
So big letters are rare Variable that takes multiple values.
Small letters are particular instances particular values it can take.
And so what? We just intuitively that you already understand We already decided this.
It's a perfectly reasonable equation.
This notation here, comma, turn this into the word.
Um, okay, translate into a sentence.
This is the probability of X and Y.
And if x and Y are independent, you know, there was knowing the value of one doesn't tell us anything about the value of the other One is just the product off the probability of X and probability of why we can stick in probability of X equals raining.
Why equals blue.
Multiply them together when we got our answer.
15.
Okay, we'll do something slightly trickier.
Probability later.
Not much trickier, but for now, let's go with this one.
This is a nice equation.
This says we can compute the probability of two things happening just by multiplying the independent probabilities of each of them happening.
That's great.
That's great.
Simple maths that looks a nice sort of thing that something might want to do.
Computation.
Let's use it to compute the probability of sequence.
He's a calcium.
I'm joining the Galaxy ins in one dimension.
Maybe that's capital coefficient, but they're really multi, very guardians.
Remember the big, colourful hill shaped thing? The really dimensions off 13 or 39.
We're going to draw them in one dimension because it's the only thing I can draw.
And that's the probability density function of this ball you see.
And so let's generate a sequence of things with some gallstones.
And here's some speech.
It's a word.
We're trying to recognise the word We've already said that speech changes the spectrum, changes we go through time, changes relatively slowly, so a reasonable way of generating a sequence of observations that would correspond to a seat.
Speak signal would be that the distribution off the NFC season there was a special envelope is roughly constant for a little while.
It's from the same statistical distribution.
So maybe up to this point, it's all coming from this one distribution.
If this is the beginning of a phoney, and then after that point, it changes a bit, and it comes from a slightly different distribution.
After that point, it changes again that comes from this other distribution.
So a reasonable model for generating a sequence of things for something like Speech, which has slowly changing spectral envelope through time, would be to use a galaxy in to generate a few frames of speech.
Some of these frames just every 10 milliseconds.
So we generate a bit that corresponds to this frame, generate a bit, generate a bit and then decide it's changed enough that the statistical properties have changed when we switched to a second distribution of the third distribution.
So within short regions of time, it seems reasonable to generate from a single calcium.
In other words, the distributions constant.
These samples are independent and identically distributed.
It's like saying that thie average spectrum is constant varies about that by the same amount just for this bit of time.
And then he changes and it changes and it changes.
So our questions are going to be, how many frames do we generate from gassing before moving on to a different calcium to a different distribution? On what order do we go through these Galaxies? And that's what the market was going to do for us.
We also then need to be able to compute the probability of this sequence.
So we've got an observation sequence, so that's observation one.
That's observation to that's observation.
Three whole sequence is gonna call it Big O, and I want to get the probability off the whole observation sequence.
Having been generated by these particular calcium sze, I'm going to make the assumption that we made about my socks and the weather this radically simplifying assumption that turns the equation to really simple equation that the probability off this thing coming from a Gaussian does not depend on any of this other stuff.
So we can just computer independently at this moment in time on the probability of this thing doesn't depend on any of this stuff.
We can compute that independently in time.
This is this thing.
No, the D correlation earlier is within the vector for this frame.
Okay, the related assumptions We d correlated within the features because we didn't want to model co variance because it has a lot of extra parameters for similar reasons in the abstract sense.
We're also going to assume that this is not correlated with this because we don't want to have to put promises are model model that we're going to trains the feet just to deal with that.
In fact, I think those promises would radically change the model.
Wouldn't they hit the market? We'll see when we get the remarkable.
It's so simple, so nice We're going to go to extreme lengths to be able to use it.
I didn't like it so much, so these things are conditionally independent, Given the moment generating them.
One thing doesn't depend on the other from before, we could just write that probably vote.
That's just a probability off 01 on DH 02 on three all wept and he's just turning the world on DH is just the properties of the mould independently multiplied together It was like socks in the weather.
So we're gonna see that statistically independent.
Is it true for speech? Certainly not, but we're getting familiar with this general theme.
Now we find the model that is good for computation.
It's easy to learn from data.
It's efficient.
There's nice mathematical properties.
Well, massage our problem until it looks like the sort of thing we can do with this model.
Already done it with features.
We chose the calcium, and then we did quite a lot of unusual things to the features to make them look calcium.
We weren't using a gassing.
We wouldn't need to do that.
So we're going to make this massive simplifying assumptions.

Duration

Now we have a sequence of observations, we need to model its duration.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS
so we're almost there.
The hidden Markov model speech is a signal that changes to time.
Let's just remind ourselves what some speech looks like.
So there's some speech changes through time.
Maybe for this little 25 milliseconds.
That's not, Let's pretend 25 milliseconds.
The spectrum is approximately certain shape on for the next frame.
It's really similar, so those could be generated from the same probability distribution.
But at some point maybe about here.
Things change quite a lot, so we might want to have all of these frames generated one particular calcium.
Maybe that's these guys here.
And then, at some point that this is the word may be moving into the next sound in the word the sounds here they're all statistically similar.
They all look like the same distribution.
So where they could be generated from a separate galaxy And maybe that's these guys here and then at the end.
Maybe all of this stuff has a similar spectral envelope.
It's statistically similar, and that could be generated by this guy's here.
Of course, we got some design decisions to make.
About the particularly unit were modelling is a phone in a word on How many galaxy ins in a row do we need to generate that? That's something you're going to explore in the lab.
You can vary that.
You could see what the effect of that is on the accuracy of the model.
So we got to choose got gallstones.
12 and three.
We could have four off 10.
Any number.
We want this thing that we modelling.
It could be a word which is always going to be in the lab.
It could be the phone.
It could be anything we want, right? So all we need to do now is that given we're going to generate from the sequence of KAOS wins in a row, we just need to make sure that things happen in the right order.
So our model has got three garrisons in it.
This one, this one and this one on DH.
He's going to be learned from data.
We're going to do the learning of this model at the end of the course because it's the hardest part.
For now, we're just going to pretend we know how to get these gals Ian's magic.
You're going to give them to you, given these gals Ian's.
Our model is just a box labelled school model.
A happens to be of the word.
Yes, and in the box there's some parameters the promises of the model on the parameters of these three guardians This one this one on this one remember, all the guardian is is just a mean understand deviation or variance.
So we just write numbers down for those things.
Know we need to do now is when we use this model to generate on seen examples or to assign a probability to an unseen example, we better make sure to use this gas is in the right order.
So we better generate from this one first, then this one and then this one.
It makes no sense to reorder these things that we like doing dynamic time warping where we compare the beginning with the beginning.
Then the middle with the end and then the end with the middle Never make sense for speech.
Speech is what this sequence is ordering.
So we just need to have a little mechanism to make sure that when we generate from the model in the three garrisons in the box, we start with first one, then the second one, then the third one.
So we need We're just going to do that with the simplest possible way.
Little finite state machine.
The finite state machines states a finite number of those states.
In this example, the three has transitions that tell us the order were allowed to go in between the finite state machines.
We could do things like this.
We could go to this state.
We could stay there for a bit, and then we could move on.
Stay there for a bit.
Move on.
Staying this one from it.
Move on.
What we can't do is go backwards.
That would be like doing time.
Reverse with the speak signal that never make sense except House of Silence.
So we gotta papa galleons into the finite state machines just to get this order and constraint.

Live class

Here is a link to a recording of the class.

Find the recording in the General channel on Teams, or via this link.

The complete model

Finally we put all those pieces together and we have the complete HMM model, which is actually rather simple.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS
Okay, so we've now got our generative hidden Markov model inside the States for the Galaxy ins.
There's a galaxy.
And there there's another galaxy in You're the calcium final state machine says First go here, take this calcium, generate some frames so randomly suppressed the bottom randomly sample from the calcium.
The finite state machines then says randomly, at each time step.
Either stay in the same state or go to the next state on becoming some probabilities on those things that will come onto later on DH.
Then we want next eight on the next date.
Okay, so there's a hidden Markov model.
Let's just wrap up in the last couple of minutes, it's mark off.
So the Markoff was a statistician, I guess.
But his name has become associated with a particular property of the model.
Is memory loss, in other words, assumes we move, make a transition and move on to the next frame.
We completely forget what we did in the past, so the probability of things in this frame is independent of things in the past that lets us use this equation.
We've got a very simple model of duration in the model that's this thing here, you much in the probability of going around here is nearly one.
The probability of moving on is one minus that it's highly likely will stay in the state for a really long time.
We had a very biassed coin that when we tossed the coin 999 times out of 1000 we got heads.
We'll go around here many, many, many times before moving on, and then we have a long duration here.
The probability on this transition gives us a duration model on the duration.
What is this place? Simple exponential decay model.

Conditional independence of observations

The HMM assumes each observation is independent of all the others, given the state that emitted it.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS
So we've done this very simple sort of probability that if two things are statistically un correlated, that means that knowing the value of one doesn't tell us anything about the value of the other one.
We can compute the joint probability of these two things happening by just multiplying their individual probabilities.
We talked about my socks and the weather for quite a while.
We decided the weather is no informative about my socks because I get a bit dark.
I don't know whether it's just pulled around them.
On course.
The weather doesn't depend on what sucks weather.
Lots of other things.
That's a nice form off probability that two independent events we can compute the property of both of them happening by just multiplying their problems.
We've already applied that to hidden Markov model.
They hit a mark off model is going to make that assumption about consecutive observations.
Consecutive FCC vectors M.
A.
C C vector this time is independent of the one of the previous time.
We're going to develop that I do a little bit more, actually conditionally independent.
Given the model, it's generating it.
And so the probability off a whole sequence of observations coming out of a model is just the probability of each of the individual observations multiplied together.
That's beautifully simple math we can see that's gonna tend to really nice computational algorithms.
So it's going to make it easy to learn the model from data.
I give you some models and I tell you that parameters.
So the means and variances here just one dimensional galaxy ins and I give you an observation sequence.
Then we can work out the probability that this model generated this observation sequence that this model generous.
This observation sequence compared the two numbers on announced which model was more likely to have generated the observation sequence and what we actually computing there is the probability.
Let's give this some notation off the observation sequence.
Oh, given one of the models, we're going to look today.
How that notation works is going to use this bar on the model is going to model the words given the words given the word, we can compute the observation sequence.
Probability on that probability will depend on which of the models we choose to generate it.
So this observation secrets probability depends on the model that we use to generate it.
One model will generated high probability than the other.
Okay, so this bar, this notation just is the word given given that remember that this other notation, this common notation that just means okay, It was just notation turns into nice English words that mean things.

What is "hidden" ?

The model can use many possible state sequences to generate any given observation sequence. We don't know which one it used, and we don't care!

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS
Now it's a hidden Markov model.
This word hidden means something very special and very simple, that all it is is this.
If the observation sequence has got Mohr items in it than there are states, then there is more than one way that that model could have generated that sequence.
Another where there's more than one alignment between five things and three things.
There's one alignment, and you can think of lots of other alignments because we don't know really, which one the model used.
We say it's hidden.

Relating DTW to the HMM

The template in DTW is now replaced by a model, but otherwise the methods are conceptually very similar.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS
remember this grid and then can't draw on the screen well in this grid.
Remember, we were lining to actual recorded examples of a word.
We called one a template because we made it in advance and labelled it.
And the other one is the unknown words trying to recognise Andi.
We had to line them up and then measure local differences between we reduce the problem to one of alignment stretching on of local distance computation, and we saw that there's lots and lots of different ways of aligning them when we have to search for the one that was the best, the line gave the lowest total distance.
Exactly.
The same thing is going to happen in our hidden Markov model.
But now, instead of a template, we have a model.
We're now going to align the model with this unknown sequence off observation factors.
So this diagram is also a way of thinking about hidden Markov models.
And if you read the older textbooks, if you particularly read homes and homes, you see pictures like this for doing recognition with hidden Markov models, and it's fine for one model in one template.
When you start joining models together, you start stacking these grids on top of each other, and it gets pretty very pretty quickly.
If you're doing this in the 19 eighties, that was how you were thinking about search connected speech recognition.
We're gonna do it in a much neater, clever way with a nice paradigm that's in one of the readings.
It's called token passing.
So the job then was want to compute probability of oh, given W.
So we're gonna get a model, will be trained in the next lecture.
Given that model, compute the probability that it generated this observation sequence.
Now the correct way of doing that, by definition, is to add up all the different ways the model could generate it all the different state sequences, Adam there probabilities and do the total probability that will be expensive because that would be like trying every path in this great.
Now we could do it in an efficient way, using down programming, but it's still gonna take a bit of time to do that.
So what we going to do is we're going to approximate that and, it turns out empirically.
In other words, by experiment.
This approximation is a very very good one.
Instead of computing.
Pierrot given W.
By summing together all the ways that the model could generate observations were going fine the single best way, in other words, the single path.
That's the single most likely way the model could generate.
The observations.
We're going to use that to stand in for the sun, and it's gonna work just as well, because when one is the biggest, theatre will be also the biggest.
That's an empirical results, so most the probability is going to be on the single most likely ones.
We don't need to compute all the other ones.

The Viterbi algorithm is dynamic programming

When we apply dynamic programming to an HMM, we call it the Viterbi algorithm.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS
going, Tio.
Now develop an algorithm and it's so special.
It's got a name.
It's just dynamic programming dynamic programming applied to this particular form of model to this Markov model on it's called the Viterbi Algorithm.
Peter B.
Is a person.
He exists.
I think he's a University of Southern California and the actual building named after the Viterbi School of Engineering.
It's kind of weird for a living person, but that's what goes on DH.
Although he didn't really know it.
He invented this algorithm as applied to speech recognition.
But he's only dynamic programming.
We'll see later on that same algorithms have generic names like Expectation Maximisation, and then they have specific names when we apply them to a particular model.
So as we play that tape Markov models, we might call that forward backward.
We might just call it even more specific to people's names that found Welsh algorithm.
Okay, so you're in the game early enough.
You get to have the algorithm named after you, but they are just specific instances off.
Very general algorithms are well known, not really invented by these people.
They're just applied to these models.
This is more evidence for why we really, really like using hidden Markov models because these generic algorithms, like dynamic programming and forward backward, apply nicely and cleanly to the model.
So that's why we're using, hmm, not because we think they're a good model of speech.

Numerical issues

Probabilities can get very small, so we must take care when storing and computing with them. The most common operation on probabilities is multiplying them. That turns into addition of log probabilities - how convenient!

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS
frequent practical point, which will come up in the practical.
All of this probability stuff seems to involve a lot of multiplying.
One probability, by another likelihood by a prior probabilities, are numbers less than one multiplied together.
Lots and lots of numbers.
Less than one gets smaller and smaller and smaller numbers and then the computer.
That's a problem, because the number of decimal places essentially fixed we have some sort of precision on will just fall off the end.
It'll just become zero.
Things will get something called under Flo.
So to avoid that and avoid precision problems because it's hard to precisely represent very tiny numbers, we just work in terms of the log of the probability.
So instead of writing things down like these tiny probabilities here we write the log of it.
Because it's a number less than one belongs a negative number.
It's okay.
Multiplication in the probability domain just turns into what in the log to Maine edition, Good people logs.
That's why larger invented to turn hard modifications into easy editions.
And so all those numbers coming of age day care, long probabilities.
That's why there are weird things, like minus 3605.
And that's something like having 0.3000 zeros.
Okay, you see what we do long? Try it yourself.
Get your pocket calculator out.
Just 2.1 and multiply myself enough times and eventually just fall off the bottom.
Okay, If you longer 0.1 and just added together, you could keep much, much longer.

The Viterbi criterion

At the core of the Viterbi algorithm is this simple dynamic programming operation: discard all paths that cannot win.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS
Let's just state the Viterbi criterion for H Mountains and compare it to this picture here and see exactly that.
They're the same thing, and then we'll just work it through with Sergeant Toka passing in the hidden Markov model.
We know the hidden means that there's more than one sequence through the model that could generate any particular observation sequence.
That's the hidden things we don't really know which one.
The model talk we're imagining that speaks signal we recognise was actually generated by an hmm.
It wasn't it was donated by a person, but we're pretending that's our model.
ATT That point of generation, we don't really know what states he was.
The model took inside a black box basins.
We just have to say, Well, he just took all possible state sequences, all with different probabilities, were just some of them together.
And then we've said something together is a big object.
Expensive.
Well, just think about the single best one is a proxy approximation for that.
That's the hidden part.
The Markoff part means the model has no memory, and it's the fact that we're going to assume we're gonna make this assumption.
That's not really true, but the probability of our big observation sequence is just the probability off the first observation times the probability of the second observation on, on and on.
There's no conditional probabilities in there.
We're saying that the probability of the first observation does not depend on the second or the third observations.
Statistically condition independent.
Knowing one is not informative about the other one.
It's clearly untrue.
I show you some little bit of Spectra government ask you to guess the spectrum either side.
It definitely helps you guess what's happening just before and just after.
It's not random.
It's conditional.
Independence, however, we're gonna make this rather strong assumption.
And that's because we so in love with Tomoko.
I have always algorithms.
This is about assumption, but it's really hard to find a model that under the assumption and works better market that independence is the mark off part of hidden Markov model.
You know what's hidden his mark for Markoff? Just imagine saying memory less, and what that means is that as we go through the model, so here's a bit of model.
Okay, we take some little walkthrough, the model to generate so generations like this Okay, here we are.
We'll take a little walk around them, walk through the model.
So off we go into state one.
When we arrive in a state, we emit an observation vector according to the galaxy in in that state.
And then we make a random choice of going around the cell transition or to the next state.
So there are numbers on these.
So we toss a biassed coin who's probabilities are perhaps the probabilities on these arcs.
Maybe we go around here and then we met the next observation.
But the probability of admitting that next observation only depends on the fact that we in that state we've forgotten even how we got there.
Memories White like a goldfish around the bowl.
Everything is new again.
Okay.
Very strong assumption, but greatly simplifies the maths in the algorithm, both learning the model for decoding speech with.
And what that means, then, is as we explore the path through this model, we can explore them in parallel.
So back to this diagram here, or they saw the projector.
See these two puffs A and B.
We're doing parallel exploration of all the possible paths.
Right? Imagine we're trying to find a short way from here to the lab as a class on what we're going to do is just all set off together.
We'll take different routes, right? And then we'll occasionally bump into each other and compare notes.
Whoever was taken longer just retired from the race.
Eventually, one person arrives at the lab, perhaps one or two other people, and they compare notes and this one winner on every time two paths bump into each other.
We only need to keep the best one of that point because we know that the future is independent of the past, given the present that's dying off a dynamic time warping that showing that president means that point in the grid, the future means all the different ways of getting to the end of the past means how we got there.
Weaken stayed exactly the same thing for the hidden Markov model.
The present is the state state is everything.
The state, in fact, is all we need to know in order to generate an observation.
So I don't need to know when I'm generating this observation here.
All I needed to say rhyme in the state, Give me the galaxy in that belongs to the state will generate an observation from the galaxy of all we need to know.
Do not need to remember how we got there.
So we don't remember what we did in the previous times that we have no idea what we're going to do.
The future.
So the present is the state.
The future is the transition we're gonna take next.
The past is where we came from.
State, the preceding state or the preceding state sequence.
And none of those things matter.
The future in the past Don't matter.
Only the present matters.
That's the mark of property.
Saying you just need to know what state you're in, That's all.
I don't remember anything else.
So imagine we're doing this parallel search through Hidden Markov model.
We'll see it in a minute.
On one way of arriving at this state at this particular time is this way.
On the other way is that we happen to be in this state earlier when we got there through this transition two ways arriving in the state, there's gonna have different probabilities because different ways of aligning the observations with the model before we even generate the next observation.
We could just compare those two paths and say, Look, sorry, but I could get here with a better probably than you can.
You can just retired from the race now.
Okay, that's the Viterbi criterion, and it's just a dynamic programming criterion that says you can eliminate paths as soon as possible.
You already know there's no way they can win.

Token passing

This is a really nice way to understand, and to implement, dynamic programming in HMMs.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS
Okay, so let's work that through now with an algorithm called token Passing, we're going to introduce article tokens, tokens of just partial paths.
So path, eh, could be a token.
Okay, so hopefully you play some board games that have maybe tokens will this checkers draughts? Something like that's imagine.
Little tokens on DH.
That's going too.
Conceptual thing.
It might be even a bit of object in the code called the Token.
Token is a partial path.
It makes a little journey.
So point am where am be meet.
Imagine toodle tokens bumping into each other.
They're just gonna have a fight on the one with The best probability is gonna win, and the other one's just going to be deleted.
So tokens a partial paths.
And what do they need to know? They just need to know the law of probability so they could be compared with other tokens, little number stored on them and that numbers updated every time they make a transition.
And every time they met in observation, how's it updated? We'll just multiplies by the probability of transition multiplies by the probability of the mission because we didn't logs where she just add in those long probabilities.
And if we want to be able to look a token that pops out of the end and say, Which model did you come through? Then we might also record the sequence of states or models that token went through.
They might actually record the past for a single, isolated model.
We don't even know that we don't need.
So here's soaking passing.
We need a model.
So let's just form a light model.
So we're ready to go with hokum passing his head of market fall.
It's a bit different to the ones in the lab on.
We could draw pictures of it.
That's the best way of picturing it.
And there's these transition probabilities and hopefully have already had a look inside the model files that your training in the lab and they have those transitions restores The Matrix on this model.
Has this transition matrix.
He's got a couple of funny ones here.
We wouldn't normally use these.
That's just to show that we've got extra.
That one is this number here, and that one is this number here.
So go model.
It's got probabilities on all the transitions problems on the transitions and in the states, there are multi, very Gaussian distributions.
Okay, with maybe some of the dimension of one of the observations on the problems.
They don't steal that they represented by mean understand deviation.
There's our complete model on what we're going to do.
We're going to put a dummy start state on the beginning and the end.
We'll see why that's very useful.
Later, when we try and connect models together to make models of sequences of words, formals of sequences of coatings and we're going to do the algorithm.
So here the animations start put a token at the beginning and this dummy start state on the token says my path is empty.
I'm not done anything yet.
On my probability is I'm the only token there ever wass The only thing possible this point is me and I got property of one.
Okay, it was extremely simple.
Every time frame handle, we just turn this handle blindly on the handle.
Does this first thing we do is all the tokens make one step forwards wherever they are.
I have to go along on a lark.
So there's only one art for this guy to go along We'll go here and end up here now.
Time, time, counter.
This is the time.
But let's say there's this time minus one before things that really happened.
This is time zero have been computer scientists, right? I'm going to start from zero, not one one.
This is the first real time step we met.
Observation.
That's observation that Time t zero this art here.
Since it's the only arc leaving, let's go have a probability of one on it.
So multiply our tokens.
Probability by one it's still one we get here.
This state has got a galaxy in it.
It's a multi area calcium because dimension off the observation vector.
We're really given this observation, this thing which I recognise.
So we just look it up, Read off the probability, remember how to do that.
Multiply the tokens Probability by the probability that that Gaussian generated this observation vector toking probability is now less than one good and then we turn the handle again.
This token has to leave the state so you can either go this way.
We'll go this way and what we're going to do that is explore all passing parallel.
We're gonna clone the token clone into two copies of itself and send one long one on one or the other.
Okay, The two possibilities to parallel ways of doing things like here as we leave a point in the great we could go up, Bagley or across we just passed copies itself.
One copy goes each of the three possible ways.
Here, a copy of the token goes each of the possible ways.
Here there were two.
And off we go along those arcs.
So one copy of the token goes this away.
He's gonna end up back in this state.
Another copy.
The token goes this way and then the next date And now the time counters ticked on Same time counters one one turn of the handle And we now generate observation vector oh, T equals one.
Now we generate the same observation vector from this state on from the state just under the same one time tables on the general Be different.
Garrisons in those two states will be different probabilities on these arcs of these tokens now have a different probably from each other.
This one represents the probability that the model generated the first two observation vectors.
Both from the first emitting state.
This token represents the probability that the first factor was emitted by the first emitting state and the second factor by the second meeting state two different ways.
This model could generate the same observation sequence.
What we do is just turn the handle over and over again.
So now we've got the tokens in these two states will generate that unfortunate used a different accounting system there.
That's time, too.
And we just generate our whole observation sequence.
Just keep doing that on every time.
Two tokens, meat, tokens of partial pass two tokens.
One token comes in along here.
Another token comes in along.
Here, we compare them.
Who's the winner? We limit everyone except the winner.
There's no point sending these tokens on through the model.
It's just a waste of computation, because whatever one token doo doo the other could do just as well.
The observation sequences given we've got that in advance, it's a given.
His big O.
Yeah, that's time that's given this handle is the clock.
It's a little vigil clock.
Every time we turn it around, we go forward in time.
One frame, one time step so all the tokens.
First, try and generate this one.
Turn the handle, and then they all try and generate this one.
Eventually, we just generate the whole observation sequence printed.
So the clock on the algorithm, the handle going round is this discreet time clock ticking through the frames.
But what the tokens are saying, the tokens of saying there are many different ways that this model could have got up frame for frame 56 on all the tokens that present to anyone time or are all those possibilities Okay, so there's a problem in terms of token passing tokens bump into each other.
Looks exactly the same things.
Path A and B.
There's a B B bump into each other.
We only need to keep the most probable one.
Eliminate whoever is not the most probable.
There might be more than two if there's more than two ways of arriving in this state, and we just keep on doing that, and eventually, out of the end of the model pops a token.
Now we turn the handle on this model and the models got three emitting.
States will start getting tokens popping out after frame three, but they're still more observations to generate.
So any token that pops out to Syrians premature has failed to generate the whole observation sequence just evaporates, doesn't count.
Keep turning the handle.
Tokens keep going through and popping out of the end.
Just as we've generated exactly all of the observation sequences, a token pops out of the end, and that's the winner.
It's successfully gone through the whole model on generated all the observation sequences tokens that are still back in ST three.
Having generated all the observation sequences they lose, that's like popping out the sides of these grids, popping out this edge.
You lose or popping out of the other edge you lose.
So either you spend too long early on in the model and never made it to the end, or you got to the end too soon.
Neither of those things a valid you have to arrive in the top right hand corner.
You have to get through the model on DH.
Generate exactly all the observation sequences

Pruning

Even with the approximation to P(O|W) made by the Viterbi algorithm, recognition might be too slow. So, we must speed things up further.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS
Now we can make this thing go arbitrarily fast.
So if someone tells you that speech recognises really fast, this is not impressive.
It's impressive.
It's still accurate when it's fast.
Think about what's happening in token passing here, there are only two tokens alive.
What happens at the next stage? This one clones around here and here.
This one goes on here on here.
This one goes around here on DH tokens arriving here eventually.
If there's a big model on the long enough observation sequence that might be tens or hundreds or thousands of tokens alive in this big network of hidden Markov model and some of them are gonna be really unlikely.
Imagine we've got a really long observation sequence hundreds of frames on.
One of the tokens just went here.
Just went round round round here, just trying to do everything from the state on and on and on.
Hey, that's going to probabilities.
Going pretty unlikely.
It's pretty unlikely, then suddenly makes a dash for the end of winds.
Thanks.
That's like saying that all of these sounds like the beginning of the word.
Okay, so we're saying the word one which is going all the way and at the end very quickly.
That's really unlikely.
And so we can make things go fast by doing additional discarding of tokens.
There's two distinct different things happen when two tokens meat, we can eliminate all but the winner, and that's perfect.
It's not an approximation.
There's absolutely no point keeping the second best.
They will never win.
That's ever free.
She will always do that.
We could go further even when we get a winner.
So imagine this competed with all that opens arriving there, and it was the winner, but it still looks really unlikely.
By some measure, we could just eliminate it outright.
Okay, I must go pruning on school pruning.
Think of gardening are plants.
It's got two big cutting branches off.
It's like all these paths going through here.
When we look at some of them, we say OK, locally or the best you be ever arrived here, but it looks pretty unlikely that you're going to win prudent right, and by pruning those paths, there's unlikely pass.
We could reduce the number of tokens and therefore reduce the amount of computation.
We could make everything go faster.
I need less memory because there's two of us have to live in memory.
Hey, so Peter B.
Is not really pruning.
Think of pruning as an additional thing on top of Viterbi.
That's involves discarding tokens that even were locally the best.
But they're not as good as the token somewhere else in the network at that time.
Okay, not something you can explore in the practical use short you could do pruning.
You could make a system go faster.
Eventually, you'll make the accuracy worse because sometimes accidentally, you'll prove the thing that would have won.