Hidden state sequences and alignment

Hidden state sequence, Trellis (i.e. lattice) representation of an HMM, aligning observations to states

Whilst the video is playing, click on a line in the transcript to play the video from that point.
00:0000:03 Hidden-Markov model comprises a set of states.
00:0300:09 In each state is a multivariate Gaussian, and that Gaussian can emit observations.
00:0900:14 Each time we arrive in a state of the model, we emit an observation, and then we must leave the state.
00:1400:20 And we must leave along one of a set of transitions that dictates what sequences of states are allowed.
00:2000:25 The model we just saw was a very simple model with a topology, a shape called left to right.
00:2500:28 So the transitions, the arcs between the states govern that.
00:2800:35 And because there are transitions from a state to itself, we can end up back in the same state and thus emit more than one observation from the same state.
00:3500:38 That's a very simple model of duration.
00:3800:39 So here's the model we just invented.
00:3900:42 This one's now got five emitting states.
00:4200:48 I've drawn a one-dimensional Gaussian in each state, just to show you that there is a Gaussian there, but of course there will be multivariate.
00:4800:56 So let's have this model emit, or generate, a sequence of observation vectors.
00:5600:58 These are our observation vectors.
00:5801:01 For example, our MFCCs.
01:0101:06 These lines indicate emission, generation, two words for the same thing.
01:0601:14 And this picture shows a five-state Hidden Markov Model emitting a sequence of observations in a particular way.
01:1401:18 The first emitting state emits the first observation in the sequence.
01:1801:22 And then we go around the self-transition and emit the second.
01:2201:25 And then we go around the self-transition again and emit the third.
01:2501:27 And then we leave.
01:2701:29 We arrive in the second emitting state.
01:2901:32 And then we generate the next observation in the sequence.
01:3201:35 We go around the self-transition and emit the next one.
01:3501:38 We go around the self-transition and emit the next one.
01:3801:40 And then we leave.
01:4001:41 And then we emit the next observation in the sequence.
01:4101:46 So every time we arrive in a state we must emit an observation and then leave.
01:4601:51 So this time we leave along the self-transition, and emit another one from that state, and then we leave.
01:5101:54 We emit this one, we leave.
01:5401:57 Back in the same state for one more time.
01:5701:59 Back in the same state again, another observation.
01:5902:00 And then we leave.
02:0002:03 We emit an observation.
02:0302:06 Go around, emit an observation, and we're done.
02:0602:14 Hopefully it's already immediately obvious that there might have been another way of going through the model to generate this observation sequence.
02:1402:19 What I just drew there was one particular state sequence.
02:1902:24 For this particular state sequence, we can compute the probability of this observation sequence given the model.
02:2402:28 And it's just going to be the product of all the emission probabilities.
02:2802:38 So of this Gaussian generating the first observation, multiplied by the probability of this Gaussian generating the next observation, multiplied all the way through the sequence.
02:3802:42 And we could also put some probabilities on the transitions.
02:4202:46 So as we go around the transition, there might be a probability on that.
02:4602:48 Let's write one.
02:4802:51 There's quite a high probability of going down this transition.
02:5102:53 So we'll multiply by all the transition probabilities.
02:5303:00 Transition probabilities leaving any one state have to sum up to one because it's a probability distribution over all the subsequent states.
03:0003:13 So the probability of this model emitting this observation sequence and using this particular state sequence is just the product of all the emission probabilities and all the transition probabilities.
03:1303:23 And it's that very simple product because the model assumes that the probability of emitting this observation sequence only depends on the state that you emit it from, and that's all.
03:2303:36 So the probability of this state emitting this observation and then emitting this observation and then emitting the next observation is just computed independently as the product of those three probabilities.
03:3603:41 The model doesn't know, it doesn't remember what it did in the previous time step.
03:4103:50 And that lack of memory, that independence of one observation in the sequence from the next given the HMM state is the Markov assumption.
03:5003:53 That's where Markov in the word Hidden Markov Model comes from.
03:5303:55 This model has no memory.
03:5504:03 Now that's a very strong assumption, and it might not be entirely true about this particular observation sequence, but that's the assumption the model makes.
04:0304:06 Now this picture is the most common way of drawing a picture of a Hidden Markov Model.
04:0604:11 It's the one you'll see in textbooks, but it's actually potentially really confusing.
04:1104:14 That's a model and that's an observation sequence.
04:1404:23 Quite clearly, this is time in frames, for example, 10 milliseconds apart.
04:2304:24 That much is clear.
04:2404:27 This is not time.
04:2704:41 This particular model happens to have a topology, a shape, in other words, a set of allowable transitions that is left to right, so we're clearly going to move through the model in that order, but this is a perfectly valid model.
04:4104:44 This is a completely valid Hidden Markov Model.
04:4404:53 Now it might not be particularly useful for doing speech recognition with, but it's a valid model, and so any algorithms that we develop must work for this model as well as for this model.
04:5305:03 So this picture is potentially extremely confusing because the model on the top and the sequence of observations on the bottom are not both along the same time axis.
05:0305:05 Only the observations have a time axis.
05:0505:07 The model is just a finite state model.
05:0705:10 We could draw the model on its side, upside down, back to front.
05:1005:12 It would be the same model.
05:1205:14 The layout on the page has no meaning.
05:1405:16 We should draw a picture that's much clearer.
05:1605:18 The model is a model.
05:1805:20 It always exists.
05:2005:28 It exists now, exists in the past, and exists in the future, and it exists every time step of the observation sequence.
05:2805:34 So let's draw a picture that makes that completely explicit.
05:3405:39 I'm going to draw a copy of the model at every time step.
05:3905:42 This is time in frames.
05:4205:44 So we're stepping through that one frame at a time.
05:4405:46 So I'm just going to draw out a copy of the model.
05:4605:51 I'm going to duplicate it or unroll it over time.
05:5105:53 I'll just draw the states.
05:5306:01 That's the model at time one, at time two, at time three, and so on.
06:0106:05 I'll draw a copy of this model at every time frame.
06:0506:08 This is just a statement that the model always exists.
06:0806:09 It's always there.
06:0906:12 For this particular model, this is the starting state.
06:1206:20 And so the first observation in my observation sequence is always going to have to be emitted by that state because that's where we are at time one.
06:2006:24 And so that says we must be here.
06:2406:28 And we can't be in any of these other states because we haven't had time to go down the transitions to get there.
06:2806:31 I'm going to also assume that we must finish in this state.
06:3106:34 So I'm going to define that as the end state of the model.
06:3406:38 When the model has generated an observation sequence, it must always end up in that state and no other state.
06:3806:43 And that means that that state must always generate the last observation in the sequence.
06:4306:47 That's a statement that we must be here and not here.
06:4706:51 And so we've got to get from here to here.
06:5106:59 We've got to get through the model following the allowable transitions as we go through the observation sequence, and the model generates the observation sequence while we do that.
06:5907:04 This is starting to look quite like dynamic time warping, and that's deliberate.
07:0407:12 So our generative model of sequences is the Hidden Markov Model, which is made of a finite state machine and those states contain Gaussians.
07:1207:14 And now we need to deal with a problem of this model.
07:1407:23 And it's the same problem we had in dynamic time warping, that the alignment between the model and the observation sequence is not known to us.
07:2307:25 Here's the potentially confusing version of the picture again.
07:2507:29 We've got the model on the top, we've got the observation sequence along the bottom.
07:2907:39 I showed you earlier one particular state sequence, one particular way of going through the model and generating this observation sequence, but there are many, many more ways of doing that.
07:3907:44 Just like there are many, many possible alignments between a template and an unknown in dynamic time warping.
07:4407:47 So we can draw an alignment on this picture.
07:4707:54 Let's just make one up.
07:5408:02 And if I number these states, let's just number them this way, 1, 2, 4, and 5.
08:0208:14 Now I've drawn a state sequence that is 1, 1, 1, 1, 2, 3, 3, 3, 4, 4, 5, 5, 5.
08:1408:18 But you can go away and work out all the other state sequences that are possible.
08:1808:24 They have to start with 1, and they have to end with 5, but there are many, many permutations there.
08:2408:29 Remember what we're doing when we're using a Hidden Markov Model to do pattern recognition.
08:2908:43 We're assuming that one of our Hidden Markov Models that we've stored did generate this observation sequence, and our only task is to work out which one can do it with the highest probability, which was the most likely model to have generated it.
08:4308:49 It will be one of the models, but we don't know what state sequence that model used to generate this observation sequence.
08:4908:55 And so we say that the state sequence of the Hidden Markov Model is hidden from us.
08:5508:58 And that's the definition of hidden in Hidden Markov Model.
08:5809:06 It's that many possible state sequences could have been used to generate the observation sequence, and we don't know which one it was.
09:0609:09 On the left is what I really want to compute.
09:0909:21 I want to compute the probability of an observation sequence given a particular model, and I'm going to repeat that for all the models I've got, and pick the model that gives me the highest value, and use that to label the observation sequence O.
09:2109:24 On the right is what we've already seen how to compute.
09:2409:30 We can compute the probability of an observation sequence, and taking a particular state sequence Q.
09:3009:33 So Q is a random variable, and it says state sequence.
09:3309:41 For example, Q might take the value 1, 1, 1, 2, 2, 2, 3, 4, 5.
09:4109:45 And if we know the state sequence, we can compute the probability of the observation sequence in that state sequence.
09:4509:49 So we can compute this joint probability here.
09:4909:58 So how can we get P of O, given the joint probability of P and O and Q, both conditioned on the model, given this particular model?
09:5810:06 Well, there's an easy way to do that, and we just sum over all the possible values of Q.
10:0610:08 That's a simple rule of probability.
10:0810:14 This operation is called marginalising, or summing away Q.
10:1410:18 Now I just told you that there might be a lot of possible values of Q.
10:1810:20 Q is the state sequence.
10:2010:24 And the longer the model gets, and the longer the observation sequence gets, the more values of Q are possible.
10:2410:27 It could be a very, very large number.
10:2710:29 So this summation could be a very large summation.
10:2910:32 It could be computationally expensive to do this summation.
10:3210:35 But this is the correct thing to do for the Hidden Markov Model.
10:3510:43 If we really want to compute the probability of an observation sequence given that model, we must sum over all possible state sequences the probability of the observation in that state sequence.
10:4310:47 So we must sum it away, marginalise it out.
10:4710:53 Because that computation is potentially very expensive, we very often make an approximation to it.
10:5311:01 And we say that it's approximately equal to finding the best Q, the best single value of Q.
11:0111:08 So we just take the maximum over all possible values of Q of this thing here.
11:0811:18 So pick one state sequence, the one that gives you the highest probability, and assume that that's so much better than all the rest that that is a good enough approximation to the sum.
11:1811:22 That turns out to be empirically a very reasonable thing to do.
11:2211:31 This works well in practice, we could do the summation, we could replace it with this maximisation operation, and our speech recogniser will work just as well.
11:3111:34 It'll give just as low a word error rate.
11:3411:40 How do you find the Q that maximises this probability?
11:4011:43 We've got to search amongst all the values of Q.
11:4311:47 But we don't want to enumerate them out because then we might as well just do the summation.
11:4711:58 We want a very efficient algorithm for exploring all the different values of Q, the state sequence, and minimising the amount of computation needed to find the best state sequence.
11:5812:01 Well, that sounds awfully familiar.
12:0112:10 So we define the Hidden-Markov model, and we said it has this property, state sequence, that's the journey we need to take through the model to emit an observation sequence.
12:1012:15 And we don't know what state sequence our model took when it generates a particular observation sequence.
12:1512:21 And so the correct thing to do is to treat the state sequence as a random variable, and to marginalise it away, to sum over it.
12:2112:38 The summation could be very expensive, so we'll make an approximation, we'll take a shortcut, and we'll just say, just find the most likely single state sequence, and compute the probability of that state sequence generating the observations, and we'll take that as the probability that this model generated this observation sequence.
12:3812:47 So we need a data structure on which to do that computation, that computation of finding a single most likely state sequence.
12:4712:52 Whenever I say state sequence, you could also hear alignment between the model and the data.
12:5212:57 And then you could think, that's just the same as dynamic time warping.
12:5713:10 We're going to now create a data structure for doing that computation, and it's going to have to generalise to any shape of Hidden-Markov model, any topology, any pattern of transitions of connections between the states.
13:1013:18 It mustn't be restricted to the case of very simple models that move from left to right, although for speech, almost all our models will be like that.
13:1813:25 We don't want an algorithm that's in the restricted case, we want a general algorithm, and we want a data structure on which we can do that algorithm.
13:2513:27 So that's what we're going to invent now.
13:2713:29 So we'll start again with dynamic time warping.
13:2913:41 In dynamic time warping, we had an algorithm, called dynamic programming, and to store the computations whilst doing the algorithm, we needed a data structure, and we made a data structure called the grid.
13:4113:51 That's the grid, and each cell in the grid stores the partial path distance from the beginning to this point that's the best so far.
13:5114:00 We saw that the core computation in dynamic time warping was to find all the different ways of arriving in a cell on the grid, and to only keep the best one.
14:0014:06 So the core operation here was a min over distances, to take the minimum.
14:0614:10 Well, minimum over distances and maximum over probabilities, that's the same thing.
14:1014:13 Just think of one as being one over the other.
14:1314:23 So min over distances and max over probabilities are the same thing, and so this suggests the sort of data structure that we might use for a Hidden Markov Model might look something like the grid.
14:2314:30 The grid is a bit too restrictive, because it requires us to go left to right through this sequence, and left to right through this sequence.
14:3014:37 That's fine for a template and an unknown, but we're going to replace this with a finite state network that could have any topology.
14:3714:42 It might not necessarily go left to right.
14:4214:45 So the data structure we're going to use is called a lattice.
14:4514:51 So on the right is a lattice for this Hidden Markov Model here with three emitting states.
14:5114:57 I'm now going to start drawing my Hidden Markov Models in a way that matches the way that HTK implements them.
14:5715:02 It's going to turn out to be extremely convenient later when we need to join models to other models.
15:0215:08 So I'm going to add non-emitting start and end states, and I'm going to start numbering that as number one.
15:0815:11 The first emitting state is number two, and so on through the model.
15:1115:16 We're going to switch now to HTK style models.
15:1615:20 This is an implementational detail, but it's going to make something later much easier.
15:2015:26 So I must start in my non-emitting state in the lattice, and I must end in my final non-emitting state.
15:2615:29 So they're the definitions of where we must start and end.
15:2915:35 In dynamic time warping, we had a rule, a rule that we had to invent about how you can move through the grid.
15:3515:37 We said that the rule was this one.
15:3715:38 Other rules are available.
15:3815:47 If you go and read the old literature or rather old textbooks, you'll find that there are other possible patterns of moving through a grid, but this is the one we stuck with, the simplest.
15:4715:58 The counterpart of that in the Hidden Markov Model is that the model topology connections and the lack of connections between certain pairs of states tells us how we can move through the lattice.
15:5816:02 So this model says that from the non-emitting start state, you must always go to the first emitting state.
16:0216:04 That's that one there.
16:0416:08 When you arrive there, you must emit the first observation.
16:0816:09 That's done.
16:0916:11 And then time advances one step.
16:1116:17 We must take a transition at every time step, and then you arrive at a state and generate an observation.
16:1716:25 The topology of this model says that you could stay in the same state when time moves forward, or you could move to the next state when time moves forward.
16:2516:30 So you could stay in the same state, or you could move to the next state.
16:3016:35 So it's impossible to get to this state to emit the first observation, or to this one.
16:3516:38 So these will never be visited in the lattice.
16:3816:40 And then we generate this observation.
16:4016:44 Now we could generate it from this state or this state.
16:4416:48 So we do both and we store both computations separately.
16:4816:51 Those are two partial paths through the lattice.
16:5116:55 We don't know which one is best yet, they're in different states, so we store them both.
16:5516:57 We have two parallel paths.
16:5716:59 So we're doing computation in parallel.
16:5917:03 We're exploring all possible alignments between model and observation sequence.
17:0317:09 And this says that there's two different paths alive at the moment, and we can't get there.
17:0917:13 And we've generated this observation sequence, first one and the second one.
17:1317:15 And then we move forward in time again.
17:1517:22 The topology of the model says that from this state, you can stay in the same state or move forward.
17:2217:29 And if you're still here, then you can move forward and it will stay in the same state.
17:2917:30 And so on.
17:3017:50 So we just use the topology of the model to fill up the lattice.
17:5017:52 Now we have to end up here.
17:5217:56 So these paths that leave the model too early are no good.
17:5618:01 That's equivalent to coming out of the end of the model, but not having generated all the observation sequences.
18:0118:02 So you haven't accounted for them all.
18:0218:05 So that's not a valid path through the model.
18:0518:06 These ones we already said you can't even get there.
18:0618:09 And we can see that this one is a dead end.
18:0918:11 There's no way to get to the end in time.
18:1118:14 And so is this one.
18:1418:17 And so the lattice is this picture here.
18:1718:21 And this lattice is the exact counterpart to the grid in dynamic time warping.
18:2118:24 And we would do computation on it in exactly the same way.

Log in if you want to mark this as completed
Excellent 60
Very helpful 10
Quite helpful 3
Slightly helpful 2
Confusing 2
No rating 0
My brain hurts 1
Really quite difficult 5
Getting harder 16
Just right 54
Pretty simple 1
No rating 0