Training – Viterbi training

Iteratively re-aligning the model with the data and updating the model parameters based on the single best alignment.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS
Now we're going to improve this model.
I'm going to state just stated that there isn't a single step solution.
There's no equation immediately gets us to the true model.
All we can do is take a model that we've already got and try and make it a little bit better.
So in other words, we're going to use an alternative method.
So one analogy that might or might not be helpful is what if we're trying to have an algorithm that's just trying to find the highest mountain in Britain? Okay, we don't have a helicopter and we don't have a map.
We can't.
We don't have a picture of Britain and just pick it.
We don't have this oracle knowledge.
All we have is just local surroundings where we are so a simple algorithm that will be just to keep walking uphill.
Eventually we get to the top of a hill.
It doesn't not guarantee to be the biggest hill, but it will be a locally a maximum.
If you take very small steps, you might take a long time to get to the top of the hill.
But well, very precisely, find the top of the hill but it might be just the local help.
So we walk from here and start walking uphill.
We're just going to end up where we're going to end up off the seat still around here.
But it's certainly not the biggest mountain.
We might therefore try to think of a better way of finding the biggest hill.
One way of doing that.
We just take giant steps all over Britain again.
Always uphill, but very large, very crude steps.
Pretty fine.
There's a bigger hill, but we weren't very good at finding exact top because we keep going past it and down the other side, zigzagging get us to a roughly a place where there are bigger hills.
But it won't find us exactly.
The top of one of them might switch back to our slow algorithm was taking small steps again to find you and get to the very top so often in machine learning will do.
Algorithms like this will have a fast and dirty algorithm that gets us to the right region quickly and then switch to the so I'll go fine tunes and gets us to the top of that region.
But none of these algorithms make any guarantees that when you do get to the top of the hill and you converge on your eyes, there's no steps you could take that go up.
There's no promise that that really is the biggest mountain.
There might be another one we never explored.
We never got.
That we can say is it's bigger than anything in the immediate vicinity.
So these interested methods have the potential of finding you a solution.
That's not the globally optimal solution.
And you'll never know if they were the global optimal solution because you don't know what that solution is.
Okay, just empirically compare one solution with another and see if it's better.
We never know.
So the true Asian man has parameters which will maximise the likelihood of the training data.
We don't know what they are.
We can't know what they are.
All we can do is find our best local guessed that so just this very crude algorithm where we just linearly segmented, uniformly segmented data, assigned it to model.
That's clearly not very good.
Can we do better than that? Of course we can.
We already know how to do one thing that's better than that that's what we do during recognition.
We find the most likely single most likely state sequence that generates that data.
Okay, that's the Viterbi algorithm.
Now, to do that, the model has to have some parameters.
We have to start with a model house parameters.
This method here, the states might have no Garson's in the beginning, It doesn't matter.
They're not involved in making this uniform segmentation.
So this works for a blank empty model.
Gives those model immediately some parameter values.
But I'm very good.
But her first guess everything after that needs a model to start with.
We're gonna have to do this first, just two guests at the parameters of the model, or make some other guests like Randomise the promise of the model set them to the global mean of Arians some other guests.
But there has to be some parameters.
Once we've got parameters we can use, the Viterbi algorithm actually implemented stoking passing to find a better alignment between observations and states.
If we could do that, you'll still be a harder line.
Matisse observation will belong to exactly one state, so it will be a forced hard alignment, and then we could just do this sort of thing again.
The observations were associate with state.
We just take their mean update.
The parameters of that state this is quick and dirty doesn't give a very good model.
But at least it has parameters, given that crude model will realign it with data.
So, for example, we might form token passing.
And now the model that we got from the crude alignment aligns itself with data in this way, this is the single most likely state sequence that generates this observation.
We start here as always.
We go here and make an observation, and then we go on to the next state, made this observation round here, make this observation around here again and make this observation on here General observation round here and here.
So the steak sequence is going to go.
Let's do Huk numbering going to go to 333 for four.
So this one belongs to state too.
These ones belong to state three.
These ones belong to State For under.
This is a single most likely way that this model could have generated the observations and now we'll update the model parameters on that basis.
Remember the mean that we're having at the moment in this state was actually the mean of these two observations from the first step.
We now say that it's actually more likely that this state also generated this other observation Here.
Steal it from the stay next door, shuffle the alignment around.
I've got slightly better alignment than the uniform one, and then we're just going to take all of it.
All of these guys have a look.
Divide by three day.
This mean all of these guys and it was only once.
That's not work very well.
But in general, the sequences will be longer.
Take its meaning variance and update the mean here out of date.
I mean, here we update the model parameters, and now they're slightly better than we were before.
So take those.
Take the meaning variants means deviation.
Update the moment parameters.
Now the models change.
And so with these new model parameters, maybe this is no longer the most likely alignment between observations on more.
Khun, see what's happening here.
We have a model with not quite the right parameters from which we can find an alignment.
We're then going to change the model parameters So we need to find the alignment again on the line.
My change.
So we need to change the model parameters.
We're going to go around that until that converges, for example, until the alignment stops moving about or more generally until the likelihood of the data stops increasing.
So the model slightly different.
So we update them on parameters, Realign the data again, things shuffling around on.
Now, what's happened is that this this state, this state here is still good at generating these two.
But the state of stolen this one and is now taking these.
Okay, so we've decided this model the best way of modelling the data is that the first state has quite a short duration secretary along with duration and generates these two on This generates these three.
We're happy so far.
Any question so we can know what we can see? That every time we go around this algorithm, we change the alignment and therefore we need to update the model parameters and because we've updated model parameters, that might change the alignment.
So we're going to just get a rate backwards and forwards between those two things until we can't do any better.
Well, just measure the likelihood of the training data so that as we do, the tokens going round as the winning token pops out at the end, we'll look at its likelihood.
Remember it next time we go around, hopefully likelihood better.
We'll make a little plot of that when it stops getting better and we'll abort.
Stop.
Or maybe we'll just give up after a fixed number of iterations.
So we keep doing this, going around, updating the promises again and again.
We go around until we converge in terms of our ability.
Now, let's just just fix a few problems with that.
This stay here just generates a single observation, so computing its meaning variance is a problem.
The mean is just equal to the observation.
The variances.
Zero.
That's no good.
So in reality, we can't reliably train a model on a single training example, because we might just get a single frame associating with state.
You could try that an experiment, see if you can train them all on a single observation secret things might go wrong, but in general we don't just have one training example for each model.
We have many, so we'll do this.
This is for the first recording of Let's Say, it's this word.
Eight, but we'll find this alignment.
Remember it and then we'll pop in our second recording of this word.
Eight.
Maybe this one's a bit longer.
It's got a few extra frames.
This's our second recording.
Find that alignment, remember it.
And then when we update the state parameters, we'll just pull the observations across the different recordings.
We just pull them all together.
So across the multiple examples in the training set find all the things that were associated with state to be at least one from each of the recordings and possibly a sequence.
I just add them all together and divide by So this generalises trivially to multiple training examples.
You just do the alignment, separate for each, pull everything together and then do your state updates.
That implies you need to pass through all the training data once and then update the marble parameters and go around this league, so that's looking okay.
That will give us a reasonable model.
The uniform's segmentation instant, but it's not a great great model.
B tter.
Be training is going to be fast because this Viterbi algorithms.
Extremely efficient gives us quite a lot better model in an H g k.
Those both those about folded together into one tool is called 18 yet, So it initialise is the model.
That's what it means.
And he just does these two things.
So, agent, it will print out it.
Orations in these situations are reiterations of this Viterbi style training.
It will produce a train model that will save it.
We're going to your hmm zero directory.
You could do recognition with that model.
It will work.
It might not be as good as the model we're about to make, but you could compare that.
So this is a roughly train model, but it's fast.