HMM training with the Baum-Welch algorithm

HMM training using the Baum-Welch algorithm. This gives a very high level overview of forward and backward probability calculation on HMMs and Expectation-Maximization as a way to optimise model parameters. The maths is in the readings (but not examinable).

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT
A LIGHTLY-CORRECTED VERSION MAY BE MADE AVAILABLE IF TIME ALLOWS
going to get to the bound Welsh algorithm through so much simpler training methods.
And if you only ever understand these simply training methods, that's fine, because the second of them Viterbi training, works pretty well.
Ideally, for the course.
I want you to get at least a conceptual understanding of how to get from Viterbi training up to the full bound Welsh algorithm, trying to sound the maths from the readings if you can.
But if they're beyond you, at least try and get the concept that we're moving from one best path toe all paths from one best state sequence.
Allstate sequences.
Thies, too.
Simple methods are used, and in fact, both of them are on one after the other in the assignment in the tool called H in it.
Now, you won't able to tell it in the assignment because the data so small and the models are very small.
But agent it is very fast, and then Bam Welch is used as a follow on to that to initialise the models with these very fast algorithms that aren't quite perfect.
And then we run down Welch in H rest, we'll re estimation, runs the bound Welsh algorithm.
We run them in that order.
These simple methods make a hard alignment between observations and states.
In other words, anyone observation in our training data is aligned with precisely one state in the model, and it contributes only to updating the galaxy in in that state when we update the model parameters.
One way to describe the bomb Welsh algorithm slightly vague way is that there's a soft or probabilistic alignment between observations and states.
In other words, on observation in the training data could have potentially being emitted by one of quite a number of states, potentially any state in the model.
And it could therefore contribute to updating the parameters of all of those states to varying degrees, depending on how likely it was that it came from that state.
That it was emitted by that state.
Slow start with simple methods to simple training methods.
First one is called uniforms segmentation.
Here's the sequins off.
Six observations were gonna consider training a model only on a one observation sequence for now, and we'll worry about multiple sequences in a moment.
We got six of them.
We've got three emitting states.
The obvious, rather naive thing to do just to divide them up, take the mean of these two and write the mean here, take the variance, and that's its variants.
These ones they used to update this won on these ones are used.
Update this one done.
There's no point repeating that because you got the same answer every time.
That's called uniforms segmentation.
That's pretty easy to understand on this finite state picture.
But let's do it on the lattice because it will help us get to the bottom Welsh algorithm, which would be just too hard to draw on the final state model.
What did you inform segmentation look like? We've got 1234567 10 11, 12 13.
Observations are sequence.
We got five states.
It doesn't divide uniformly, but there's going to be about three observations per state.
So we're going to take a path through the lattice, which is the closest to diagonal that we can.
You have to just do some rounding so we maybe go give the first three to that state on three two, and then three and then, too.
And that's just like saying thes.
Three observations going to contribute to this calcium, so that's uniforms segmentation on the lightest.
Just take the most diagonal path that you can, given that you couldn't interview number off observations and an interview number of states.
There's only one such path.
So you just take that one and update the mobile.
Let's go back to the finance state model for you for a minute.
We can always find a better alignment between the states and these observations on the uniform one, and that's by doing what we do it recognition and using the Viterbi algorithm.
Now the Viterbi algorithm is only going to work.
If the model already has parameters, you form segmentation Khun work.
Even the bottle has no parameters because the model primitive have no influence.
On the alignment is just the number of states and number observations that matter.
So we use uniforms segmentation to give the model some parameters.
Given those parameters from the Viterbi algorithm to find an alignment between states and observations, possibly this one in general, this will be different to you.
Form segmentation will use this alignment, update the models parameters.
The model promises have now changed, and they're for if we run the Viterbi album again we make a different alignment, so maybe next time we'll get misalignment and will update the model promises again.
I will repeat that until the model stops getting better.
We could think of that until the alignment settles down to something stable.
But in fact, what we measure is the probability of the observations given the model and when that stops getting better determinate the algorithm.
So this is attractive, given the model parameters we find in alignment.
We use that to update the model model has changed.
We find the next alignment, which should be better because the models got better and we repeat that until it stops getting better, ones that look like on the lattice.
Well, now, rather than just finding the most diagonal path through the lattice, we actually do dynamic programming.
So off we go do dynamic programming.
Maybe it finds a path through the lattice like this.
We used that alignment to assign these observations to this state.
These observations to this state, all of these observations to this state and so on.
Then we just repeat that until the model set stops getting better.
So that's two very simple methods of training them all uniforms.
Segmentation is not intuitive because you got the same answer every time, so no point running it twice.
It's fast.
It's very approximate, but your model gets some parameters that are better than nothing better than, for example, giving them uniformed parameters.
In all the states, you've got an approximate model.
You then use that as the starting point for Viterbi training.
You do a few alterations off Peter be training, realigning the model with the data until the model stops getting better.
And then you got a trained model trained with a Viterbi algorithm before we move up to the bound Welsh algorithm going to talk about something else.
And that's how do we use multiple training examples? So let's stick with Viterbi training, where we find on a line between a model at an observation sequence and then update the model.
So let's assume that I've used uniforms segmentation to initialise my model.
So this picture is the same model six times, aligning itself with six examples, six training examples.
So let's imagine this is a model off three on DH.
We got six recordings of the word 31 model.
We want to estimate this model on these six recordings for each training example.
So that's training.
Example.
123 for five and six have recorded it six times will perform Viterbi alignment to align the model with the observations.
That one was easy.
This one maybe would find this one and so on.
So we do.
Viterbi are going to find these alignments, I imagine that's what my Viterbi algorithm finds me for the six different recordings of three but the one same model.
And now I update the mole by averaging across all six repetitions.
So this state aligned with this observation this one, this one, these two this one and this one nuts.
123456 There seven observation factors there.
So just sum them up, divide by seven, and that will become the mean of that state.
And we'll do the same for variance and for the means of Aristotle.
The other states, when we train on multiple training examples, multiple observation sequences, we simply do the summing and the averaging over all of them, rather just within one sequence.
The extension is very straightforward.
So for the rest of the explanation, I'm going to just revert to training on a single observation sequence because it makes the explanations much easier.
But the extension to multiple observation sequences is easy.
Any summation we did within observation sequence, we just extend to something across observation sequences, each of which gets the line separately like this on each of which, as we can see here, might have a different generation.
These observation sequences of 34 or five frames long.
That's just another level of summation.
That's not difficult.
What is more difficult concept realises this idea of many alignments.
Alignment is state sequence two ways of saying the same thing.
So we've seen that this three state model on this four observation sequence this same observation Sequels just drawn three times because I'm now going to show that there's only three ways in which this model kind of line with this observation sequence, there's all of them.
There aren't any other possible alignments or state sequences, knowing that we can now start to think about how to use this information about the many different alignments to update.
For example, this state's mean this state's mean should be computed by averaging all of the observations that it emitted in the training data that it aligned with.
Start with the easy case Observation one.
It is the same observation, just drawn three times always aligns with this emitting state here.
There's no other way for the model to admit this observation sequence.
So the probability that this state emitted observation one.
Oh, see that Time T one is one.
It always does.
So the state occupancy probability off this state purple state here at time.
One is one.
Let's look observation, too.
We can see at a time to only one of these cases.
Does that stating that the observation and it doesn't admit it in the other two alignments in the other two states sequences so informally we can say that observation, too, should make some contribution to the mean off this first emitting state because sometimes that state emits it, but not all the time.
So you should make a partial contribution on the amount by which it contribute should be weighted by the probability of being in that state of time, too.
The quantity that we need to compute that is the probability of being in a particular state at a particular time.
State occupancy probability and that will be used as a weight on the observation from that particular time.
When we go away to some of those observations, not take the mean of the model.
This is getting increasingly hard to draw on, Understand? On these finite state picture of the model, you really are going to have to use the lattice to talk about this.
Let's zoom in now on this particular state.
It's the second emitting state of this.
Hmm.
At this particular time, time three, I would like to compute on updated parameter for this model.
What's the new update of the mean off this state of this model? Seeing this observation sequence, let's draw out the lattice.
This lattice reachable parts of the slightest are these these cannot be reached.
Pretend they don't exist.
How much would this observation contribute to this mean? Well, it's proportional to the probability of being in that state of that time when we could be in this state at this time when we're going toe, be updating that mean as well well, we could be in this state at this time.
We're going to be updating that mean as well.
So this observation to make a contribution to all three of those means on the proportion by which it contributes is going to be waited by the state occupancy probability industry Of these three parts of the lattice.
The one of these will be more trouble would be less probable.
We'll just use those weights.
Essentially, all of these states could emit this observation.
The state that's most likely to admit it gets the most contribution from that, too.
It's mean.
It's new, updated mean and the others get less contribution.
There's just like a soft version ofthe Viterbi training.
The wait will be the probability of being in that state at that particular time.
How do we compute the probability of being in this state at this particular time? Well, we have to start here on some along all possible paths of getting there.
Those were all possible paths in this case.
Then we have tto emit the observation from the galaxy in that the model currently has the old model parameters, and then we have to get to the end of two some along those paths so we can see that to compute the probability of being in this state at this particular time we have to get there, emit an observation, and then we have to get to the end.
Now there's nothing special about left and right on this diagram, up or down, we could draw this picture anyway.
What we like it still be valid.
And so we don't have to do the computation left right either.
So these computations here, conceptually we could think of computing them from the end, coming backwards.
So just conceptually, then the probability of being in this state at this particular time is some forward probability, which is to get there from the beginning, the observation probability on DH, a backwards probability and hence the name of the argument, the forward backward algorithm.
Now that looks like it could be computation expensive for every state.
Every time we have to compute a forward probability and the backward probability and combine them to find the state occupancy probability and then remember that, and then later we'll use that as a weight and a weighted sum of observations to update the parameters of the model in the latest up, called the M Step Here in this step, the e step expectation step, or simply averaging step we computing these statistics occupancies that would be very inefficient To compute this forward probability and this backward probably for this state at this time, throw it all away and then do the same for this one and the same for this one.
And then the same for all the rest of the latticed repeatedly compute these forward probabilities and backward probabilities.
But we don't have to do that because we know about dynamic programming.
We know that we could give you all the forward probabilities by sharing computations because Plaskett meeting and keep something so we can share all the forward probabilities on.
We can share all the backward probabilities.
We're going to stop at that point and not do the maths if you want them assets in the readings.
Do try.
Understand if you can, but it's not going to be directly examine able in that mathematical form.
It's only going to examine him in the conceptual form to compute the state.
Our committee probability.
We combine forward poverty in the mission of the backward probability in the books.
You'll see pictures that looked like this, and that's because they are deriving the general case where the hmm could have an arbitrary topology.
Anything could be connected with anything.
And so we have to consider all incoming paths.
No outgoing paths.
That's just the general case.
You'll also see pictures in the books.
But consider a pair of states because we're interested in, for example, this transition probability that comes with self transition here.
And then we'll have to do something like this.
Now we'll be looking at the probability off the forward part ofthe taking this transition at this particular time and then a backward part.
I'm not gonna say anything, Maura, about transition probability estimation.
Still, in just the same way as the Galaxy ins we need zoom in on a particular transition in the model.
We need to count how many times we took it.
Compare that to how many times we took all the other transition probabilities and normalise.
So we done with costs, give you a couple of tips of things you could do next.
If you wanted to follow on with this subject, there's a speech synthesis course next semester that picks up directly where we left off in this course, and that assumes that we've covered almost everything we need about text processing about the front end, and it goes much more deeply into away from generation.
It does unit selection properly and in depth, and then move forward from that into statistical parametric speech synthesis and your networks, which are the current state of the art.
It is also a crossing in dramatics called automatic speech recognition.
She can picks up where this course finished.
We had hmm Me's off phone aims.
Joined together with anger models.
It takes that forward into context dependent models of phoney is called tri phones into clustering them and then on to the state of the art wishes.
Also, neural networks, that's all.