Training – Baum-Welch (no maths)

By summing over all possible state sequences we marginalise away the state sequence: we don't care what value it takes.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS
right, but we need to do the right thing.
The right thing is not to take the single most likely state sequence.
Just pull the data from that.
It's to consider all of the other states sequences to allow them to contribute to the estimates off the state parameters of the means and the variance.
Okay, we're going to do this kind of longhand they were goingto quickly see.
That would be much more efficient way of implementing that called bound Welsh algorithm.
That's let's say this is the single most likely state sequence for a three state model to generate six observations.
Let's just write out the state's equal.
So we really clear about it and we'll do a CK numbering.
The state sequence is 233 for four four.
That's your number one best state sequence.
If you're gonna pick one, this is the best.
But there are lots and lots of other states sequences.
For example, we could have gone 223444 on DH.
So on, And so on many, many other states sequences on all of them are a little bit less likely than this, but maybe not that much less like we should be taking those into account because we should be summing over all possible sexy.
That's the definition of a hidden Markov model.
So we're going to have to find some algorithm that does that.
Great.
So what we're going to do is we're going to consider every possible state sequence that the model could have used inside this black box.
It's simultaneously.
Just do all of them, not all of them simultaneously.
They all exist at once the moment we've just picked the most likely one and use that we're now going to consider all of them.
Wait.
Those by the associated probability most likely ones gonna make a strong contribution to the model parameters.
But it's going to combine with some of the other ones as well.
Lots of different stakes sequences.
Each of them has an associated probability.
We happen to know that that was the highest.
But there are other ones we want to take into account as well.
So I need to sum over all of them.
Hey, so we're going to introduce an algorithm.
We're not going to do that.
We're going to just think about concepts of it and introducing terminology that you'll see in textbooks.
The general idea of this next algorithm is a bit like the Viterbi training.
It has two parts to it.
In the Viterbi training, there's one part which is finding the alignment on the second part, which is updating the model parameters.
Given that alignment, that was just the single most likely alignment.
What we're going to do now is we're going to consider all possible alignments and then given those we're going to update the moment, Prime Minister.
Still two parts to it, and all it will do is slightly improve the model.
The only guarantee we have is the model won't get any worse on the training data.
We have no promises about test data.
That's up to us to engineer still two parts to it, finding the alignment between states and things.
What we going to do is actually average overall possible state alignments waited by their individual probabilities.
So the fancy word for averaging improbabilities called expectation, expectation part and then, given this alignments, all these different alignments were goingto turn the knobs and all the model parameters to maximise the likelihood of the training data.
So the maximisation step that's just a simple equation that says mean of a state equals waited, some of the observations that we've aligned with it.
The general algorithm is of this type called expectation maximisation, and it's so common it's often just called E.
M.
If you're taking other courses of machine learning or in an LP or the speech recognition course, you'll see that there's lots of different ways that this concept could be applied to models applies whenever we don't have a single step solution or we could do is make a step towards a better solution.
So some averaging part.
And then there's some updating parameters, part expectation maximisation.
And when it's applied to hidden Markov models, it gets given a specialist more specific name That's called the Bound, Well, child driven.
Okay, two people.
Professor Bohm, Professor Wells.
Okay, let's rewrite Viterbi trading in what seems like a slightly funny way, and then you see why we did that.
Think about this alignment.
The mean here equals the sum of this thing.
Plus this thing divided by two let's rewrite that in a different way is the sum of everything waited by some weights.
It's the weights are the probability of a lining.
That observation with state.
It's gonna be exactly the same thing here.
It's going to be zero times this thing because it never aligned.
Plus one times this, plus one times this zero times this zero times listen zero times us so we'll take those things.
And, well, some of them up waited by hard numbers, zeros and ones.
And then we'll divide by the sum of the weights.
So that will be one over to here.
You see, that is exactly the same thing.
It's just a more general form.
So the mean of the state is the sum of all observations.
Waited by something on the weights are probability, but lining that state without observation, the general form, that's kind of nice.
That would be a good way to implement it on bitter being gives us the ways it says The way for this zero didn't align the way for this is one and so on.
It was right that write it down This general form.
So the mean off a particular state is a some off all the observations.
So this sum is now over all the observations, the entire training data, each of those observation gets waited by some value called probability, some weight.
And then we need to normalise so that things of probabilities that some correctly and that some normalising factor we should just get some of the weight.
In that example, it was won over to normal.
Most of the normalising factor was, too.
What are we going to do now is that these weights are going to go soft instead of hard ways of 01 They're going to be soft waits.
Okay, so what is this p of? I think it's the probability that a particular state generated a particular observation.
Viterbi.
That's either.
It absolutely did.
Probabilities one or absolutely didn't probability zero.
But if we look across all different state sequences and take these averages, sometimes that observation will be aligned with state to sometimes of Eli mistake one.
So those weights now become values between no exactly zero or exactly one.
This problem's going to something like probability of being in a state of particular time, which is exactly the same thing as saying that state generated that observation.
Yes, so this probability here is gonna be a value that's not zero or one is going to be our belief, The probability that when this hmm generators observation sequence that this observation observation six in the sequence came from ST three.
Hey, in some states sequences that will sum, it won't.
So, what was the probability out from that? Okay, with this e algorithm, which becomes about Well, let's just see if we could work through very quickly how that might work.
So let's write.
All the state sequence is down.
Got 222234 223 34 All the way down here and eventually get to this one, which is 23 for four.
We got six things in the observation sequence.
We got three emitting states called 23 and four.
And these are all the states sequences we could fill in.
All those on one of them was the most probable one.
I already forgot which one? It wasn't.
It was 233444 So, somewhere in here, there's the 2334441 on this Polish born here.
This one is the one Viterbi finds us.
It finds the single most likely one.
And then from that we get the weights on the weights are that when we computing the new mean for state, too, it just takes off.
The first observation waited by one plus all the others off the observations, waited by zero when we compute the new mean for state three, it takes the first observation.
Waited by zero.
That's the second one waited by one.
What's the third one? Waited by one plus, the other one's waited by zero.
That's what this alignment tells us on the same probability of that state sequins.
I'm going to write out as if it was natural probability.
It's probably density.
And let's just say that 0.3 sixes, the probability of that one.
All the others have probabilities to maybe they're all.
They're all lower than that.
Maybe this one's a bit less likely.
This one's a bit less like he's still of all the different probabilities of these particular state sequences.
So the found, well, child rhythm is going to do something quite simple when we update the mean off state to instead of only looking here and saying State two takes, the first observation will wait one and the rest will take Wait zero, it says, okay, we'll take the first observation.
That's this first slot here.
All right, with this weight.
Plus the first observation with this weight.
Plus the second observation with this weight.
Plus all of these with this weight.
Plus all of these with this weight.
So we write out the mean of state two equals R sum over all of the observations.
All six off them on the weights.
Are these weights every time that state to have the possibility of generating that observation? Okay, there's a slightly complex idea, but think of it in comparison to the Viterbi is just a soft version of Peter Be.
It does all possible state sequences and then just as awaited, some of the observations.
Okay, so it's a pretty tricky concept, and that just brings us to the end of the lecture in a few minutes.
So we've gone from uniforms segmentation, which we knew was wrong because we didn't have any promise in the model.
There's actually nothing else we could do other than randomly initialise them or just initialising all two zeros and ones.
Well, then said we've got a model that's not very good.
We could use that model to find an alignment of the data to update the model, the model would get a little bit better.
We go around that until the model starts getting better, then we'll flip into the real thing on the real thing is to do this.
So this list of all possible state sequences it could be quite long, so implies it might be competition more expensive.
To do this for every possible state sequence will compute its probability likelihood.
We'll use that as a weight when we do The weighted sum of the observation.
Toby doesn't weighted sum of the weights, zeros and ones.
Byron Welch just has a weighted sum of the weights of soft ways.
They're somewhere numbers between 01 effectively.
Okay, wait.
If you think about things going backwards and forwards, that's a little bit of a misleading way of thinking about things like the Viterbi algorithm.
That'll be Album appears to operate on left the right fashion.
It could equally go right to left.
It could start in the middle and go out words we just find the same answers are on the way.
Guarantees to find is the single most likely state sequence.
It doesn't matter.
You could reverse the speech will reverse the money.
You will get the same well it will because we'll update the model promises and that will change the alignments.
That's the inspiration.
Model promises change, never alignment changes.
This is the same thing's gonna happen inbound Welsh.
So we'll take the weighted sum.
So let's just compute the mean of state to ago.
It is no 0.21 times observation one plus no 0.36 times.
Observation one plus nor 0.12 times.
Observation, too, plus no 0.12 times observation.
So realisation one does not point to +12 times.
Observation two plus no point for the ways of getting very small.
We're making small contributions now, No point north four times Observation one plus no point nor four times observation, too, and so on.
So it's just these the weights and these observations that get summed up these very unlikely states equals is up here, have a very low wait.
So those alignments make a very small contribution to the new value of the mean of the state.
The one that makes the biggest contribution, obviously, is the single those likely one big 10.36 wait times that we just do the same of dating the other states.
So what happens? We've got to some model from somewhere, perhaps Viterbi training or random initialization.
Anything we want will find all possible state alignments between that model in our observations in general will be a lot off them.
Each of them will have a probability, will use that as a weight.
And then we'll pop that weight into this formula, the new mean of a particular state.
It's just the sum of all observations the Thai data set waited by the probability across all of these different ways of lining up models with observations that that state generated that observation in some of the alignments, dead with a weight and some it didn't weigh zero on this weight.
We're just going to give it a rather impressionistic name.
That's just the state occupancy, probably being in that state of that time.
That means the alignment and then some normalising factor to make sure everything is properly and the normalising fact is just going to be some of the weights.
Okay, so we don't want the mast for that for this course, But if you prefer to understand it through the mass.
Do that.
Let's just finish off by saying a few more things about this.
So it's called the bound Well shall go them.
Please note the spelling, not Welsh pronounced Welsh.
It's about, well, Child Room is just a form of expectation.
Maximisation specialised to hit a Markov models, and in the little manual example we worked out, I explicitly listed all possible state sequences.
It was a bit like what we did in the early days before we discover dynamic programming, and we said, Well, just just enumerate all of them and actually calculate them or separately and then pick the most likely one.
And then we realise that they've got an awful lot of things in common.
Lots of them start with staying in this state for three frames, so they could all share their computation.
That's what dynamic programme tells us, tells of all these different paths.
They've got all sorts of sub sequences in common, and that bit of computation could be known once and shared amongst all pass that share that state secrets.
That's easiest to think about all the common prefixes.
We actually, for any common sub sequence anywhere.
It's not just the prefixes.
All this affects us all the bits in the middle without going into the details and the diagrams in Djurovski Martin Northern Books Trying, explain Listen, don't succeed.
Very welcomes.
A bit complicated.
We could see impressionistic Lee for bound Welsh all of those states sequences I just did listed out.
They've also got lots of things in common as well, so we could share the computation between all these states sequences.
So we don't need to explicitly write them all one by one and compute them separately, compute their probabilities there state occupancies and do the weighted sum.
We can complete it all in one go something like dynamic programming.
But instead of choosing the max at each time, what should do a sum that each time so we computed on some sort of matrix or lattice a bit like dynamic programming.
We don't need to understand that For this course, we can say that this is actually also quite efficient b'more computation than picking the single most likely Max and throw things away.
We're here.
We need to someone we need to visit the whole grid, but it still could be done much more efficiently than the kind of dumb way of just writing them all out long because I share all possible sub sequences of computation, their diagrams trying explain that which will be on the scope of this course.
Okay, so when do we start training? Well, we stopped.
When the likelihood of the training data stops increasing, you could see that coming out of H G K on the ground.
Well, Child Goodman aged Ikea is computed by this Grogan called eight Rest, Which means re estimation, in other words, implies that model needs to have some parameters, and we'll just update them on all eight dressed promises to do.
It's not making model any worse, and hopefully to make it better and better means that has a higher probability of generating the training data.
We never see the test date.
They're no promises about the estate.
There's no guarantees about that.
That's your job as an engineer is to engineer the training data so that you predict that it would be good at generating the estate