Hidden state sequences and alignment

Hidden state sequence, Trellis (i.e. lattice) representation of an HMM, aligning observations to states

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT
A LIGHTLY-CORRECTED VERSION MAY BE MADE AVAILABLE IF TIME ALLOWS
hidden Markov model comprises a set of states in each state is a Multivariate Gaussian on that gassing can emit observations.
Each time we arrived in a state of the model, we emit an observation, and then we must leave the state.
Now we must leave along one of a set of transitions that dictates what sequences of states are allowed.
The mother would you saw was a very simple model with a topology, a shape called left to right.
So the transitions the Ark's between the states govern that on because there are transitions from a state to itself, we can end up back in the same state on this, emit more than one observation from the same state.
That's a very simple model of duration.
So here's the mother we just invented.
This one's now got five emitting states.
I've drawn a one dimensional galaxy in the state just to show that there is a galaxy in there.
But of course there will be multi vary it.
So let's have this model emit or generate a sequence of observation vectors.
These are our observation factors.
For example, are MFC sees these lines indicate emission generation Two words for the same thing on this picture shows a five state hidden Markov model emitting a sequence of observations in a particular way.
The first emitting state emits the first observation in the sequence.
And then we go around the self transition and emit the second.
And then we go around the self transition again and admit the third.
And then we leave.
We arrived in the second emitting state, and then we generate the next observation.
In the sequence, we go around the self transition and in it, the next one.
We go around the self transition and emit the next one and then we leave.
And then we admit to the next observation sequence.
Every time we arrive in the state, we must emit an observation and then leave.
So this time we leave along the self transition and in it another one from that state.
And then we leave.
We met this one.
We leave back in the same state for one more time back in same set again, another observation.
And then we leave.
You make an observation, go around emitting observation, and we don't.
Hopefully it's already immediately obvious that there might have been another way of going through the model to generate this observation sequence.
What I just drew There was one particular state sequence for this particular state sequence.
We can compute the probability of this observation sequence, given the model, and it's just going to be the product of all the emission probabilities.
So off this calcium generating, the first observation multiplied by property of discussing generative next observation multiplied all the way through the sequence on DH.
We could also put some probabilities on the transitions.
So as we go around transition, there might be a probability on that.
That's right.
One is quite a high probability of going down this transition, so we'll multiply by all the transition probabilities.
Transition problems is leaving anyone state have to some upto one because there's a probability distribution over all the subsequent states.
The probability of this model emitting this observation, sequins and using this particular state sequence is just the product of all the emission probabilities on all the transition probabilities.
And it's that very simple product because the model assumes that the probability off emitting this observation sequence only depends on the state you omitted from, and that's all.
So the probability off this state emitting this observation and then emitting this observation and then emitting the next observation is just computed independently as the product of those three probabilities.
The model doesn't know.
It doesn't remember what it did in the previous time.
Step, that lack of memory that independence off.
One observation in the sequence from the next, given the hmm state is the Markoff assumption.
That's where Markoff, in the word hidden Markov model comes from.
This model has no memory.
Well, it's a very strong assumption on DH.
It might not be entirely true about this particular observation sequence, but that's the assumption the model makes.
No, this picture is the most common way of drawing a picture of a hidden Markov model.
It's the one you'll see in textbooks.
It's actually potentially really confusing.
That's a model, and that's an observation sequence.
Quite clearly, this is time in frames, for example, 10 milliseconds apart, that much is clear.
This is not time.
This particular model happens tohave a topology shape in other words, a set of allowable transitions that is left to right.
So we're clear going to move through the model in that order.
But this is a perfectly valid model.
This is a completely valid, hidden Markov model now.
It might not be particularly useful for doing speech recognition with, but it's a valid model.
And so any algorithms that we develop must work for this model as well.
As for this model.
This picture is potentially extremely confusing, because the model on the top on the sequence of observations of the bottom are not both along the same time access.
Only the observations have a time access.
The model is just a finite state model.
We could draw the model on its side, upside down, back to front with same model.
The layout on the page has no meaning to draw a picture that's much clearer.
The model is a model it always exists.
Exists now exist in the past, only exist in the future and exists every time.
Step off the observation sequence, first or picture that makes that completely explicit.
I'm going to draw a copy of the model, but every time step, this is time in frames, so we're stepping through that one frame at a time.
I was going to draw out a copy of the model good to duplicate it or unroll it over time.
We're just all the states.
That's the model at a time.
One that time, too.
Time three and so on.
I'll draw a copy of this model, but every time frame.
This is just a statement that the model always exists.
It's always there for this particular model.
This is the starting state.
And so the first observation in my observation sequence is always going to have to be emitted by that state because that's where we are at a time.
One and so that says we must be here.
And we can't be in any of these other states because we haven't time to go down the transitions to get there.
I'm going to also assume that we must finish in this state.
So I'm going to define that as the end state of the model.
When the model has generated observation Sequels.
It must always end up in that state and no other state.
And that means that that state must always generate the last observation.
The sequence.
That's a statement that we must be here and not here, and so we've got to get from here.
So here we've got to get through the model following the allowable transitions as we go through the observation sequence, Onda model generates observation secrets.
While we do that, this is starting to look quite like dynamic time warping.
That's deliberate.
So a generative model of sequences is the hidden Markov model, which is made off a finite state machine on those states contained gas Ian's and now we need to deal with the problem off this model.
And it's the same problem I had in dynamic time, warping that the alignment between the model on the observation sequence is not known to us.
Here's the potentially confusing version of the picture again that the model on the top we got the observation sequence along the bottom.
I showed the earlier one particular state sequence one particular way, going through the model and generating this observation sequence.
But there are many, many more ways of doing that, just like there are many, many possible alignments between a template and an unknown in dynamic time warping.
We can draw on alignment on this picture.
Let's just make one up.
And if I number thes states, that's just number them this way.
12 for five now I've drawn a state sequence that is 1111233344555 But you could go away on work out all the other state secrets is the possible.
You have to start with one, and they have to end with five.
But there are many, many permutations there.
Remember what we're doing when we're using a hidden Markov model to do pattern recognition.
We're assuming that one of our hidden Markov models that we've stored did generate this observation sequence.
Our only task is to work out which one could do it with the highest probability, which was the most likely model to have generated it.
It will be one of the models, but we don't know what state sequence that model used to generate this observation sequence.
And so we say that the state sequence of the hidden Markov model is hidden from us, and that's the definition of hidden in hidden Markov model.
It's that many possible state sequences could have been used to generate the observation sequence on We don't know which one.
It wass on the left is what I really want to compute.
I want to compute the probability of an observation sequence given a particular model.
I'm going to repeat that for all the models I've got on.
Pick the model that gives him the highest value and use that to label the observation sequence O on the right is what we've already seen.
How to compute.
We can compute the probability observation sequence on taking a particular state sequence.
Q.
So cute is a random variable, and it says state sequence, for example, Q might take the value 1112223 for five.
And if we know the state's sequence, we can compute the problem to the observation sequence in that state sequence.
To compute this joint probability here, how can we get TiVo, given the joint probability of P and O on cue, both conditioned on the model? Given this particular model, well, there's an easy way to do that.
We just some overall, the possible values of Q.
That's a simple rule of probability.
This operation is called marginalising or something away.
Q.
Now I just told you that there might be a lot of possible values of Q.
Q.
Is the state sequins on? The longer the model gets them along, the observation sequence gets the more values of cure possible.
It could be a very, very large number.
So this summation could be a very large summation.
It could be computation expensive to do this summation, but this is the correct thing to do for the hidden Markov model.
We really want to compute the probability of an observation sequence.
Given that model, we must sum over all possible state sequences the probability, the observation in that state sequence.
So we must summit away, marginalise it out because that computation is potentially very expensive.
We very often make an approximation to it and will say that approximately equal to finding the best cue, the best single value of Q.
So we just take the maximum overall possible vise of Q of this thing here toe pick one state sequence, the one that gives you the highest probability and assume that that's so much better than all the rest that that is a good enough approximation to the sub that turns out to be empirically a very reasonable thing to do.
This worked well in practise.
We could do the summation.
We could replace it with this maximisation operation on our speech, Recognise er will work just as well.
It will give justice lower word error rate.
How do you find the Q that maximises thiss probability.
We've got to search amongst all the values of Q.
But we don't want to enumerate them out, because then we might as well just do the summation.
We want a very efficient algorithm for exploring all the different values of Q.
The state sequence on DH minimising the amount of computation needed to find the best state sequence.
Well, that sounds awfully familiar.
Who defined the Hidden Markov model? And we've had it has this property state sequence.
That's the journey we need to take through the model to emit an observation sequence.
We don't know what state sequence our model took when it generates a particular observation sequence, I said.
The correct thing to do is to treat the state sequences a random variable to marginalise it away to some over it.
The summation could be very expensive, so we'll make an approximation.
We'll take a short cut and just say, Just find the most likely single state sequence.
Compute the probability of that state sequence generating observations on We'll take that as the probability that this model generated this observation sequence.
We need a data structure on which to do that computation that computation off, finding a single most likely state sequence whenever, say, state sequence, you could also hear alignment between the model and the data.
And then you could think that's just the same as dynamic time warping.
We're going to now create a data structure for doing that computation and is going to have to generalise to any shape of hidden Markov model any topology, any pattern off transitions of connexions between the states.
It mustn't be restricted to the case are very simple models that move from left to right.
Although the speech almost all our models will be like that.
We don't want an algorithm that's in the restricted case.
We wanted general algorithm.
We want to date a structure on which we could do that algorithm.
So that's what we're going to invent now.
So start again with dynamic time warping in dynamic time warping.
We had an algorithm called dynamic programming and to store the computations whilst doing the algorithm.
We needed a data structure and we made a date a structure called the grid.
That's the grid on each cell in the grid stores the partial path distance from the beginning to this point, that's the best so far.
We saw that the core computation in dynamic time warping was to find all the different ways of arriving in a cell on the grid and to only keep the best one.
So the court operation here walls a men over distances to take the minimum while minimum over distances and maximum over probabilities.
That's the same thing.
Just think of one's being that one over the other.
So men over distances and Max over probabilities are the same thing.
And so this suggests the sort of data structure that we might use for a hidden Markov model might look something like the grid.
The great is a bit too restrictive because it requires us to go left to right through this sequence, left right through this sequence.
That's fine for a template in an unknown.
We're going to replace this with a finite state network that could have any topology.
It might not necessarily go left to right, so the data structure we're going to use is called a lattice so on the right is a lattice for this hidden Markov model here with three emitting states.
I'm now going to start drawing my hidden Markov models in a way that matches the way that H t.
K implements them.
It's going to turn out to be extremely convenient later, when we need to join models to other models going to add non emitting started end states.
I'm going to start numbering.
That is number one.
The first emitting state was number two and so on.
Through the model, we're going to switch now to H T K style models.
This is an implementation of detail, but is going to make something later much easier.
So I must start in my non emitting state in the lattice, and I must end in by final non emitting state.
So they're the definitions away.
We must start and end in dynamic time, will think we had a rule, a rule that we had to invent about how you can move through the grid.
We said that the rule was this one, or the rules are available.
If you go and read the old literature or rather all textbooks, you'll find that there are other possible patterns of moving through grid, but this is the one we stuck with.
The simplest.
The counterpart of that in the hidden Markov model is that the model topology connexions on DH.
The lack of connexions between certain pairs of states tells us how we can move through the lattice.
So this model says that from the non emitting start state, you must always go to the first emitting state.
That's that one there.
When you arrive there, you must emit the first observation that's done.
And then time advances.
One step.
We must take a transition at every time step and then who arrive in the state and generating observation.
The topology of this model says that you could stay in the same state when time moves forward.
Or you could move to the next state when time moves forward so you could stay in the same state.
Well, you could move to the next state, so it's impossible to get to this state to admit the first observation.
Auto this one.
So these will never be visited in the lattice, and then we generate this observation.
Now we could generate it from this state or this state.
So we do both, and we store both computations separately, those air to partial paths through the lattice.
We don't know which one is best yet.
They're in different states.
We store them both.
We have to parallel path.
Civilian competition in parallel were exploring all possible alignments between model on observation sequence.
And this says that there's two different paths alive at the moment and we can't get there.
And we've generated this observation sequence first one on the second one, and then we move forward in time again.
The topology of the model says that from this state you can stay in the same state or move forward.
But if you were still here, then you could move forward and we'll stay in the same state and so on.
So he just used the topology of the model to fill up the lattice.
Now we have to end up here, so these paths that leaves the model too early and no good.
That's equivalent to coming out of the end of the model.
But not having generated all the observation sequences, we haven't accounted for them all, so that's not a valid path through the model these ones were already said.
You can't even get there and we can see that this one is a dead end.
There's no way to get to the end in time, and so is this one.
And so the lattice is this picture here and this lattices thie exact counterpart to the grid and dynamic time morphing, and we would do computation on it in exactly the same way.