Total video to watch in this section: 51 minutes (divided into 3 parts for easier viewing)
Here are the slides for these videos.
Assumptions & references made in these videos
You need to be comfortable with the concepts of independence and conditional independence, and that making independence assumptions allows us to use simpler models. For example, the Module 8 lecture should have helped you understand what independence assumption we are making when we choose the multivariate Gaussian with diagonal covariance as our model.
Whilst the video is playing, click on a line in the transcript to play the video from that point. We've been on a journey towards the Hidden Markov Model, and finally we've arrived. We started by talking about pattern matching, and we did that with an algorithm called dynamic time warping, in which we stored a template, which is just a recorded example of some speech we'd like to recognize, perhaps an isolated word, called an exemplar, and we matched that against some unknown speech that we would like to label. And we measured distance between the stored template and the unknown. We then realized that this lacks any way of modeling the natural variability of speech. The template is just a single stored example, it doesn't express how much we can vary from that example and still be within the same class. So we decided to solve that by replacing the exemplar with a probabilistic model, and we looked in our toolbox of probabilistic models and found the Gaussian as the only model that we can use. It has the right mathematical properties and it's very simple. We're going to use the Gaussian as the core of our probabilistic model, and that led us into a little bit of a detour to worry about the sort of features that we could model with a Gaussian. And we spent some time doing feature engineering to come up with features called MFCCs that are suitable for modeling with diagonal covariance Gaussians. So now, armed with the knowledge of dynamic time warping and knowing its shortcomings, and knowing that we've now got these excellent features called Mel Frequency Cepstral coefficients, we're ready to build a probabilistic generative model of sequences of MFCC vectors, and that model is going to be the Hidden Markov Model. We're going to have to understand what's hidden about it and what's Markov about it. So to orient ourselves in our journey, this is what we know so far. We talked about distances between patterns. We saw that there's an interesting problem there of aligning sequences of varying length. We found a very nice algorithm for doing that called dynamic programming and the instantiation of that algorithm, dynamic time warping, and we saw that we could do dynamic time warping. We could perform the algorithm of dynamic programming on a data structure called a grid. And that's just a place to store partial distances. We're now going to invent the Hidden Markov Model. We'll understand what's hidden about it. It's its state sequence, and that's connected to the alignment between templates and unknowns. The state sequence is a way of expressing the alignment. And then we'll develop an algorithm for doing computations with the Hidden Markov Model that deals with this problem of the hidden state sequence, and that data structure is going to be called a lattice, and it's very similar to the grid in dynamic time warping. Think of it as a generalization of the grid. And so we're now going to perform dynamic programming on the different data structure, on the lattice, and that will be doing dynamic programming for Hidden Markov Models, and the algorithm gets a special name again. It's called the Viterbi algorithm. That's the connection between dynamic time warping, as we know it so far, and hidden Markov models as we're about to learn them. So we know that the single template in dynamic time warping doesn't capture variability. There were many solutions proposed to that before someone invented the HMM, for example, storing multiple templates to express variability. That's a reasonable way to capture variability, is to have a variety of natural exemplars. The HMM is going to leverage that idea too, but it's not going to store multiple exemplars. It's going to summarize their properties in statistics. Essentially that just boils down to the means and variances of Gaussians. So although there are ways of dealing with variability in dynamic time warping, we're not going to cover them. We're going to go straight into using statistical models, where in some abstract feature space, let's draw a slice of Cepstral feature space, 4th versus the 7th Kepstral coefficient, so some slice through this Cepstral feature space. We're going to have multiple recorded examples of a particular class, and rather than store those, we're going to summarize them as a distribution, which has a mean and a standard deviation along each dimension, or a variance if you prefer. So here's what we're going to cover in this module, module 8. So we already know about the multivariate Gaussian as a generative model. A multivariate Gaussian generates vectors, and we're going to start calling them observations because they're emitted from this generative model. We know that speech changes its properties over time, so we're going to need a sequence of Gaussians to generate speech. So we need a mechanism to do that sequencing, and that amounts to dealing with duration. And we're going to come up with a very simple mechanism for that, which is a finite state network. And so when we put these Gaussian generative models into the states of a finite state network, we get the Hidden Markov Model. We're going to see when the model generates data, it could take many different state sequences to do that, and that's why we say the state sequence is hidden. It's unknown to us, and we have to deal with that problem. One way of doing that is to draw out a data structure, which is the Hidden Markov Model replicated at every time instant, that's called a lattice. We're also going to see that we can perform dynamic programming directly on a data structure that is the Hidden Markov Model itself, and that's an implementational difference. Both of those are dynamic programming for the Hidden Markov Model, and both of those ways of doing things, either the lattice or this computation directly on the hidden Markov model itself, are the Viterbi algorithm. When we finish all of that, we'll finally remember that we don't yet have a way of estimating the parameters of the model from the data. So everything here in module 8, we're going to assume that someone's given us the model, some pre-trained model. So we don't have the algorithm for training the model yet. So at the very end, we'll make a first step towards that, and we'll just remind ourselves how we can train a single Gaussian on some data. We might call that fitting the Gaussian to the data, or estimating the mean and variance of the Gaussian given some data. And we'll see that that is easy, but the hidden state sequence property of the HMM makes it a little harder to do that for a Hidden Markov Model and a sequence of observations. But that's coming in the next module. So we know already that a single Gaussian, a multivariate Gaussian, can generate feature vectors, and we're going to call them observations. That's just to remind ourselves that they are generated by a model. Let's just make sure we fully understand that. Again, I'll draw some part of our feature space. I can't draw in 12-dimensional MFCC space or higher, so I'm always going to draw just two of the dimensions. Any two, it doesn't matter. Two of the MFCC coefficients. How does a Gaussian generate? Well, the Gaussian is a mean, and a mean is just a point in the space. And we have a variance along each of the feature dimensions. We're going to assume there's no covariance, and so there's some variance, a standard deviation in this direction, perhaps some in this direction. So one way of drawing that is to draw a contour line, one standard deviation from the mean. Like that. This model can generate data points. What does that mean? It means we can randomly sample from the model, and it will tend to generate samples nearer the mean. And just how spread they are from the mean is governed by the parameter variance. So if I press the button on the model to randomly sample, it generates a data point, or another data point, or another data point, and just occasionally, a data point far from the mean. More often than not, they're near the mean, and so it generates data points that have this Gaussian distribution. So that's the model being a generative model. But we're not doing generation. We're not doing synthesis of speech. We're doing recognition of speech. So what does it mean when we say there's a generative model? Well, it says we are pretending, we're assuming that the feature vectors extracted from speech that we observe coming into our speech recognizer were really generated by a Gaussian. We just don't know which Gaussian, and the job of the classifier is to determine which of the Gaussians is the most probable Gaussian to have generated a feature vector. Now, of course, the feature vectors weren't generated by a Gaussian. They're extracted from speech, and the speech was generated by the human vocal tract. But we're making an assumption. We're assuming that their distribution is Gaussian, and that we can model with a generative model that has a Gaussian distribution. So it's an assumption. Now since these blue data points were extracted from natural speech, and what we're trying to do is decide whether they were generated by this Gaussian, we don't really generate from the Gaussian. We just take the Gaussian's parameters. We take one of these data points, and we compute the probability that this observation came from this Gaussian. And we just read that off the Gaussian curve as the probability density, and we'll take that as proportional to the probability. So the core operation that we're going to do with our Gaussian generative model whilst doing speech recognition is to take a feature vector and compute the probability that it was emitted by a particular Gaussian. And that's what the Gaussian equation tells us. It's the equation of the curve. That's the equation that describes probability density in one dimension. We've got the Gaussian distribution. Along comes some data point, and we just read off the value of probability density. It's that simple. When we do that in multiple dimensions with an assumption of no covariance, we'll just do the one-dimensional case separately for every dimension and multiply all the probability densities together to get the probability of the observation vector. And that's an assumption of independence between the features in the vector. So that mechanism of the multivariate Gaussian as a generative model is going to be right in the heart of our HMM that we're about to invent. It solves the problem of the Euclidean distance measure being too naive. How does it solve it? Because it learns the amount of variability in each dimension of the feature vector and captures that as part of the model. The Euclidean distance measure only knows about distance from the mean. It doesn't have a standard deviation parameter. So we cast our minds back to dynamic time warping and just remember how that works. This is the unknown, and that's the time running through the unknown. This is a template that we've stored. We'll assume that the first frame of each always align with each other because speech has been carefully end-pointed. There's no leading silence. So we'll align this frame with this frame, and then dynamic time warping just involves finding a path to here that sums up local distances along the way. And those local distances are Euclidean distances here. So the Gaussian is going to solve the problem of the local distance. Let's take a local distance in this cell between this vector and this vector, and it's going to capture variability. Storing a single frame of an exemplar doesn't capture variability. That's the main job of the Gaussian. But there's another problem of templates. It doesn't capture variability in duration either. There's a single example, an arbitrarily chosen example of a recording of a digit, say. Maybe this is a recording of me saying three. The duration of that is arbitrary. It doesn't model the variation in duration we might expect. The template might also be a little bit redundant. It might be that all of these frames sound very similar. They're the vowel at the end of the word three, and we don't need to store three frames for that. We could just store one, or even better, we could store the mean and the variance of those frames. So we're going to get rid of the stored exemplar, and we're going to get to choose the temporal resolution of the thing that's replacing it. The thing that's replacing it is going to be not an example, but a model. And so my exemplar had seven frames, but maybe I don't need seven frames to capture the word three. Maybe it's made of just three different parts, three different sounds. So at the same time as getting rid of the exemplar, I'm going to gain control over the temporal resolution, the granularity of my model. And for now, let's choose three. This is a design choice. We're going to get to choose this number here, and we can choose it on any basis we like, but probably we should choose it to give the best accuracy of recognition by experimentation. So where I used to have frames of an exemplar, I'm now going to instead store the average and the variance of those. And I'm going to do that separately for each of the three parts. So for this part here, I'm going to store a multivariate Gaussian in the same dimensions as the feature space, and it's going to store a mean and a variance. And for the purpose of this module, I'm not going to worry about where they come from. I'm just going to assume that we will later devise some mechanism, some algorithm for training this model. We're always going to assume for the rest of this module that the model has been pre-trained, it was given to us. So instead of storing an exemplar, here I'm going to store a Gaussian. I'm just going to draw one in one dimension, because I can't draw in this many dimensions. It's going to be a mean and a standard deviation, or we could store the variance. For the middle part, I'm going to store another one. It's going to be potentially a different mean and a different standard deviation. And for the final one, the third one would be at another mean and another standard deviation. So where there used to be a sequence of feature vectors for each of these three parts, there is now a statistical description of what that part of the word sounds like. So this axis is the model, and the model is in three parts, it models the beginning of the word, the middle of the word, and the end of the word. And we'll just leave it like that as a little abstract. Certainly we're not claiming that these are going to be phonemes, and we've already said that it doesn't have to be three, it's a number of our choosing, and we'll worry about how to choose it another time. So the model so far is a sequence of three Gaussian distributions. I've drawn them in one dimension, but in reality, of course, they're multivariate. And all we need now is to put them in some mechanism that makes sure we go through them in the correct order, that we go from the beginning to the middle to the end, and a mechanism that also tells us how long to spend in each of those parts. The duration of the unknown is going to be variable. Every time we do recognition, that might be a different number. As we'll see when we develop the model, the length of the observation sequence here is generally going to be longer than the number of parts in the model. And so we've got to decide whether this Gaussian generates just this frame, or the first two, or maybe the first three. And then how do we move up to the next Gaussian to start generating the next frames? And when do we move up to the next Gaussian to start generating the remaining frames? So we need some mechanism to decide when to transition from the beginning of a word to a middle of a word, and from the middle of the word to the end of the word. The simplest mechanism that I know to do that is a finite state network. So let's draw a finite state network. We're going to be able to transition between the states, and these transitions say that we have to go from left to right through the model. We don't go backwards. Speech doesn't reverse. And we now need a mechanism that says that you can, from each of these states, generate more than one observation. Remember, these are the observations. And that's just going to be a self-transition on the state. And inside the states are the multivariate Gaussians that I had before, with their means and variances. So we started with a generative model of a single feature vector, a multivariate Gaussian, and we've put it into a sequencing mechanism, a finite state network, that simply tells us in what order to use our various Gaussians. And we have some transitions between those states that controls that, and we have some self-transitions on the state that allow us to emit more than one observation from the same Gaussian. So those transitions are our model of duration, and we just invented the Hidden Markov Model.
Whilst the video is playing, click on a line in the transcript to play the video from that point. Hidden-Markov model comprises a set of states. In each state is a multivariate Gaussian, and that Gaussian can emit observations. Each time we arrive in a state of the model, we emit an observation, and then we must leave the state. And we must leave along one of a set of transitions that dictates what sequences of states are allowed. The model we just saw was a very simple model with a topology, a shape called left to right. So the transitions, the arcs between the states govern that. And because there are transitions from a state to itself, we can end up back in the same state and thus emit more than one observation from the same state. That's a very simple model of duration. So here's the model we just invented. This one's now got five emitting states. I've drawn a one-dimensional Gaussian in each state, just to show you that there is a Gaussian there, but of course there will be multivariate. So let's have this model emit, or generate, a sequence of observation vectors. These are our observation vectors. For example, our MFCCs. These lines indicate emission, generation, two words for the same thing. And this picture shows a five-state Hidden Markov Model emitting a sequence of observations in a particular way. The first emitting state emits the first observation in the sequence. And then we go around the self-transition and emit the second. And then we go around the self-transition again and emit the third. And then we leave. We arrive in the second emitting state. And then we generate the next observation in the sequence. We go around the self-transition and emit the next one. We go around the self-transition and emit the next one. And then we leave. And then we emit the next observation in the sequence. So every time we arrive in a state we must emit an observation and then leave. So this time we leave along the self-transition, and emit another one from that state, and then we leave. We emit this one, we leave. Back in the same state for one more time. Back in the same state again, another observation. And then we leave. We emit an observation. Go around, emit an observation, and we're done. Hopefully it's already immediately obvious that there might have been another way of going through the model to generate this observation sequence. What I just drew there was one particular state sequence. For this particular state sequence, we can compute the probability of this observation sequence given the model. And it's just going to be the product of all the emission probabilities. So of this Gaussian generating the first observation, multiplied by the probability of this Gaussian generating the next observation, multiplied all the way through the sequence. And we could also put some probabilities on the transitions. So as we go around the transition, there might be a probability on that. Let's write one. There's quite a high probability of going down this transition. So we'll multiply by all the transition probabilities. Transition probabilities leaving any one state have to sum up to one because it's a probability distribution over all the subsequent states. So the probability of this model emitting this observation sequence and using this particular state sequence is just the product of all the emission probabilities and all the transition probabilities. And it's that very simple product because the model assumes that the probability of emitting this observation sequence only depends on the state that you emit it from, and that's all. So the probability of this state emitting this observation and then emitting this observation and then emitting the next observation is just computed independently as the product of those three probabilities. The model doesn't know, it doesn't remember what it did in the previous time step. And that lack of memory, that independence of one observation in the sequence from the next given the HMM state is the Markov assumption. That's where Markov in the word Hidden Markov Model comes from. This model has no memory. Now that's a very strong assumption, and it might not be entirely true about this particular observation sequence, but that's the assumption the model makes. Now this picture is the most common way of drawing a picture of a Hidden Markov Model. It's the one you'll see in textbooks, but it's actually potentially really confusing. That's a model and that's an observation sequence. Quite clearly, this is time in frames, for example, 10 milliseconds apart. That much is clear. This is not time. This particular model happens to have a topology, a shape, in other words, a set of allowable transitions that is left to right, so we're clearly going to move through the model in that order, but this is a perfectly valid model. This is a completely valid Hidden Markov Model. Now it might not be particularly useful for doing speech recognition with, but it's a valid model, and so any algorithms that we develop must work for this model as well as for this model. So this picture is potentially extremely confusing because the model on the top and the sequence of observations on the bottom are not both along the same time axis. Only the observations have a time axis. The model is just a finite state model. We could draw the model on its side, upside down, back to front. It would be the same model. The layout on the page has no meaning. We should draw a picture that's much clearer. The model is a model. It always exists. It exists now, exists in the past, and exists in the future, and it exists every time step of the observation sequence. So let's draw a picture that makes that completely explicit. I'm going to draw a copy of the model at every time step. This is time in frames. So we're stepping through that one frame at a time. So I'm just going to draw out a copy of the model. I'm going to duplicate it or unroll it over time. I'll just draw the states. That's the model at time one, at time two, at time three, and so on. I'll draw a copy of this model at every time frame. This is just a statement that the model always exists. It's always there. For this particular model, this is the starting state. And so the first observation in my observation sequence is always going to have to be emitted by that state because that's where we are at time one. And so that says we must be here. And we can't be in any of these other states because we haven't had time to go down the transitions to get there. I'm going to also assume that we must finish in this state. So I'm going to define that as the end state of the model. When the model has generated an observation sequence, it must always end up in that state and no other state. And that means that that state must always generate the last observation in the sequence. That's a statement that we must be here and not here. And so we've got to get from here to here. We've got to get through the model following the allowable transitions as we go through the observation sequence, and the model generates the observation sequence while we do that. This is starting to look quite like dynamic time warping, and that's deliberate. So our generative model of sequences is the Hidden Markov Model, which is made of a finite state machine and those states contain Gaussians. And now we need to deal with a problem of this model. And it's the same problem we had in dynamic time warping, that the alignment between the model and the observation sequence is not known to us. Here's the potentially confusing version of the picture again. We've got the model on the top, we've got the observation sequence along the bottom. I showed you earlier one particular state sequence, one particular way of going through the model and generating this observation sequence, but there are many, many more ways of doing that. Just like there are many, many possible alignments between a template and an unknown in dynamic time warping. So we can draw an alignment on this picture. Let's just make one up. And if I number these states, let's just number them this way, 1, 2, 4, and 5. Now I've drawn a state sequence that is 1, 1, 1, 1, 2, 3, 3, 3, 4, 4, 5, 5, 5. But you can go away and work out all the other state sequences that are possible. They have to start with 1, and they have to end with 5, but there are many, many permutations there. Remember what we're doing when we're using a Hidden Markov Model to do pattern recognition. We're assuming that one of our Hidden Markov Models that we've stored did generate this observation sequence, and our only task is to work out which one can do it with the highest probability, which was the most likely model to have generated it. It will be one of the models, but we don't know what state sequence that model used to generate this observation sequence. And so we say that the state sequence of the Hidden Markov Model is hidden from us. And that's the definition of hidden in Hidden Markov Model. It's that many possible state sequences could have been used to generate the observation sequence, and we don't know which one it was. On the left is what I really want to compute. I want to compute the probability of an observation sequence given a particular model, and I'm going to repeat that for all the models I've got, and pick the model that gives me the highest value, and use that to label the observation sequence O. On the right is what we've already seen how to compute. We can compute the probability of an observation sequence, and taking a particular state sequence Q. So Q is a random variable, and it says state sequence. For example, Q might take the value 1, 1, 1, 2, 2, 2, 3, 4, 5. And if we know the state sequence, we can compute the probability of the observation sequence in that state sequence. So we can compute this joint probability here. So how can we get P of O, given the joint probability of P and O and Q, both conditioned on the model, given this particular model? Well, there's an easy way to do that, and we just sum over all the possible values of Q. That's a simple rule of probability. This operation is called marginalising, or summing away Q. Now I just told you that there might be a lot of possible values of Q. Q is the state sequence. And the longer the model gets, and the longer the observation sequence gets, the more values of Q are possible. It could be a very, very large number. So this summation could be a very large summation. It could be computationally expensive to do this summation. But this is the correct thing to do for the Hidden Markov Model. If we really want to compute the probability of an observation sequence given that model, we must sum over all possible state sequences the probability of the observation in that state sequence. So we must sum it away, marginalise it out. Because that computation is potentially very expensive, we very often make an approximation to it. And we say that it's approximately equal to finding the best Q, the best single value of Q. So we just take the maximum over all possible values of Q of this thing here. So pick one state sequence, the one that gives you the highest probability, and assume that that's so much better than all the rest that that is a good enough approximation to the sum. That turns out to be empirically a very reasonable thing to do. This works well in practice, we could do the summation, we could replace it with this maximisation operation, and our speech recogniser will work just as well. It'll give just as low a word error rate. How do you find the Q that maximises this probability? We've got to search amongst all the values of Q. But we don't want to enumerate them out because then we might as well just do the summation. We want a very efficient algorithm for exploring all the different values of Q, the state sequence, and minimising the amount of computation needed to find the best state sequence. Well, that sounds awfully familiar. So we define the Hidden-Markov model, and we said it has this property, state sequence, that's the journey we need to take through the model to emit an observation sequence. And we don't know what state sequence our model took when it generates a particular observation sequence. And so the correct thing to do is to treat the state sequence as a random variable, and to marginalise it away, to sum over it. The summation could be very expensive, so we'll make an approximation, we'll take a shortcut, and we'll just say, just find the most likely single state sequence, and compute the probability of that state sequence generating the observations, and we'll take that as the probability that this model generated this observation sequence. So we need a data structure on which to do that computation, that computation of finding a single most likely state sequence. Whenever I say state sequence, you could also hear alignment between the model and the data. And then you could think, that's just the same as dynamic time warping. We're going to now create a data structure for doing that computation, and it's going to have to generalise to any shape of Hidden-Markov model, any topology, any pattern of transitions of connections between the states. It mustn't be restricted to the case of very simple models that move from left to right, although for speech, almost all our models will be like that. We don't want an algorithm that's in the restricted case, we want a general algorithm, and we want a data structure on which we can do that algorithm. So that's what we're going to invent now. So we'll start again with dynamic time warping. In dynamic time warping, we had an algorithm, called dynamic programming, and to store the computations whilst doing the algorithm, we needed a data structure, and we made a data structure called the grid. That's the grid, and each cell in the grid stores the partial path distance from the beginning to this point that's the best so far. We saw that the core computation in dynamic time warping was to find all the different ways of arriving in a cell on the grid, and to only keep the best one. So the core operation here was a min over distances, to take the minimum. Well, minimum over distances and maximum over probabilities, that's the same thing. Just think of one as being one over the other. So min over distances and max over probabilities are the same thing, and so this suggests the sort of data structure that we might use for a Hidden Markov Model might look something like the grid. The grid is a bit too restrictive, because it requires us to go left to right through this sequence, and left to right through this sequence. That's fine for a template and an unknown, but we're going to replace this with a finite state network that could have any topology. It might not necessarily go left to right. So the data structure we're going to use is called a lattice. So on the right is a lattice for this Hidden Markov Model here with three emitting states. I'm now going to start drawing my Hidden Markov Models in a way that matches the way that HTK implements them. It's going to turn out to be extremely convenient later when we need to join models to other models. So I'm going to add non-emitting start and end states, and I'm going to start numbering that as number one. The first emitting state is number two, and so on through the model. We're going to switch now to HTK style models. This is an implementational detail, but it's going to make something later much easier. So I must start in my non-emitting state in the lattice, and I must end in my final non-emitting state. So they're the definitions of where we must start and end. In dynamic time warping, we had a rule, a rule that we had to invent about how you can move through the grid. We said that the rule was this one. Other rules are available. If you go and read the old literature or rather old textbooks, you'll find that there are other possible patterns of moving through a grid, but this is the one we stuck with, the simplest. The counterpart of that in the Hidden Markov Model is that the model topology connections and the lack of connections between certain pairs of states tells us how we can move through the lattice. So this model says that from the non-emitting start state, you must always go to the first emitting state. That's that one there. When you arrive there, you must emit the first observation. That's done. And then time advances one step. We must take a transition at every time step, and then you arrive at a state and generate an observation. The topology of this model says that you could stay in the same state when time moves forward, or you could move to the next state when time moves forward. So you could stay in the same state, or you could move to the next state. So it's impossible to get to this state to emit the first observation, or to this one. So these will never be visited in the lattice. And then we generate this observation. Now we could generate it from this state or this state. So we do both and we store both computations separately. Those are two partial paths through the lattice. We don't know which one is best yet, they're in different states, so we store them both. We have two parallel paths. So we're doing computation in parallel. We're exploring all possible alignments between model and observation sequence. And this says that there's two different paths alive at the moment, and we can't get there. And we've generated this observation sequence, first one and the second one. And then we move forward in time again. The topology of the model says that from this state, you can stay in the same state or move forward. And if you're still here, then you can move forward and it will stay in the same state. And so on. So we just use the topology of the model to fill up the lattice. Now we have to end up here. So these paths that leave the model too early are no good. That's equivalent to coming out of the end of the model, but not having generated all the observation sequences. So you haven't accounted for them all. So that's not a valid path through the model. These ones we already said you can't even get there. And we can see that this one is a dead end. There's no way to get to the end in time. And so is this one. And so the lattice is this picture here. And this lattice is the exact counterpart to the grid in dynamic time warping. And we would do computation on it in exactly the same way.
Whilst the video is playing, click on a line in the transcript to play the video from that point. So we've seen that there's a transition of ideas from dynamic time warping, where the data structure was a simple grid, to this new data structure for the Hidden Markov Model called the lattice, where the allowable transitions between the states in the lattice are dictated by the topology of the model, not by some simple external rule. And so the lattice is a general data structure that would work for any shape, any topology of HMM. And on that lattice, on that data structure, we can perform the algorithm of dynamic programming. And when we apply that to the Hidden Markov Model, it gets the special name of the Viterbi algorithm. So we're done with dynamic time warping. We learned it because it's a nice way to understand dynamic programming. We've had this transition now to the Hidden Markov Model. And a good way to think about the Hidden Markov Model is it's a generalization of dynamic time warping with templates. The template, the stored exemplar, was a bit arbitrary. It was just one recording of someone saying this digit, for example, and it didn't capture variability either acoustically, around the average way of saying it, or indeed in terms of duration. It's just one particular way of saying it. And so we've generalized, we've replaced that with a model. In general, the model will have fewer states than there were frames in the template. It's a generalization. Each of the states contains a multivariate Gaussian that expresses a statistical description of what that part of the word sounds like. So for example, it replaces these three exemplar frames with their mean and variance. And the model has a transition structure, transitions between the states that does the job of going through the model in the right order. So for most models in speech, that will be left to right, and of how long to spend in each part of the model. And that's a model of duration. It's a very simple model of duration, but it is a model of duration. So it's very important that any algorithm developed for the Hidden Markov Model doesn't rest on this left to right topology, even if it is the most common. I might occasionally want to use a model that has this transition. For example, maybe I'll have a model of silence that can just go round and round, generating as much silence as I need at beginnings and ends of utterances. And I might put that transition in the model for that purpose. The lattice still works. Let's just draw the lattice for this particular topology that has this transition going from the third emitting state back to the first one, or in HDK language, from the fourth state to the second state, because the non-emitting one gets numbered one. So for this model, let's draw the lattice. Well, the lattice topology is just dictated by the model. So from this state, we can go here or here. There should be arrows in all of these. From this state, same thing. And from this state, we can now go to the non-emitting final state or self-transition or go back to here. We can go here or leave. That would be leaving too early. Or go back to the start. There, there, back to the start. There, there, back to there. There, there, back to there. This lattice is a little bit more complicated, but it's still just a set of states joined by transitions. And any algorithm that we can perform on the simpler lattice, we can perform on this one. And what we're going to see now is that we can do the computations on that lattice. That's a perfectly reasonable way to do it on this data structure. Or we can find actually an alternative implementation which uses the HMM itself as the data structure. So actually there are two ways at least to implement the Viterbi algorithm for HMMs. One is to make the lattice. And to perform dynamic programming on the lattice. You start here, you go to the first emitting state, you emit an observation. Then you either go here or here and you emit the next observation in sequence. Now each state gets to emit that observation. You will in general emit it with a different probability and update the probability of the partial path. So there'll be two partial paths alive at this point and they'll compete later when they meet. But for now we must maintain both of them. So we're doing a parallel search and then we'll just propagate forward through the model like that. That's one way of doing it. But there's another way of doing it directly on the HMM structure. So I'm going to make a connection for you now between the way of doing it on the lattice to the left. That'll be how most textbooks describe the algorithm. And an algorithm which is going to use the HMM itself on the right as the data structure. It's not going to make the lattice. It's never going to draw this lattice out. It's going to use the HMM as the data structure. And it's going to be equivalent and that algorithm is called token passing. Along comes a sequence of observations and we're going to emit them in sequence. So there's the first one. On the left, we'll have made a transition to this state and we'll emit it and compute the probability of doing that. On the right, we'll have put a token here. We'll have sent the token down an arc. It will arrive here and we'll use that Gaussian to emit that observation. We'll update the token's probability with that probability. And then comes in the next observation in sequence. We'll go forward here and compute that. Go forward here and emit it again from a different state. So it's two different state sequences possible now. This is the state sequence two, two. This is the state sequence two, three. I'm using HDK numbering. The counterpart over here would be that one copy of the token goes around the self transition and ends up back here. Another copy of the token goes to the next state and ends up here. This one emits the observation and so does this one. And they update their probabilities and we have two tokens alive in the model at once. This token is the same as this point in the lattice. This token is the same as this point in the lattice. And then we just turn the handle on the algorithm, repeating those steps over and over again. So let's examine something in the middle of the algorithm. At some point, we're generating this observation and imagine we were going to generate it from here, from this emitting state. Two paths will have arrived from the previous time step. We'll have done the max operation to keep only the most probable. And then we'll generate this observation from the Gaussian of this state and update that probability there. The counterpart of that will be over on this side. We're arriving in this state. One token will have come from this direction. One token will be a self transitioning token that was already in that state of the previous time step. We'll do the max to keep just the most likely of those two tokens and discard the other one. And then we'll emit this observation and update the tokens probability. So hopefully you can see that the thing on the left and the thing on the right are doing the same computation. On the left, we need a data structure, which is a lattice. And the size of the lattice is going to be governed by the length of the observation sequence and the number of states in the model. Now that doesn't seem very big here because our eventual observation sequence has just got five observations and our model only has three emitting states. So our lattice only has 15 states in it. But in general, our models might be much, much bigger than that. And our observation sequences might be considerably longer, tens or hundreds of frames long. And so this lattice could get quite large and use up a lot of memory. So the implementation on the right could use a lot less memory because it only uses the HMM itself as the data structure. So the size of its data structure is just the size of the HMM. How does the algorithm on the right get to use so little memory? Well, essentially it's just computing this and then this, and then that can be discarded. And then this, that can be discarded. And then this, that can be discarded. And then this, and that can be discarded. So token passing is actually just a very particular form of search on the lattice. It's called time synchronous. It computes in strict order of time and it only needs as much memory as the HMM. The HMM exists each time. So this one is token passing. The one on the left is the search on a lattice. And we could imagine in this lattice that we might not want to strictly explore it in this order. That's only one way of doing it. We already saw in dynamic time warping there's other ways of doing it. We could explore it in more interesting orders, for example and that will be not time synchronous. We'd have moved through time before we'd finished going through all the model states. But for the purposes of this course we're only really bothered about time synchronous search. And so token passing is a perfectly good way to understand the viterbi algorithm for HMMs. So you know the HMM now. It's a finite state network and inside each state is a Gaussian acting as a generative model. It can generate a single observation and then we must leave the state either go around a self transition or to another state. It doesn't matter. The model topology tells us what's allowed. And every time we arrive in a state we use its Gaussian to emit the next observation in sequence. We've only seen that for a single Hidden Markov Model. But I've been saying all the time that the way that we do classification is by having one Hidden Markov Model for each of the classes that we want to recognise and we'll compute the probability of each of them generating observation sequence and pick the highest. So what's coming next is actually a way of connecting all the models together to make a single Hidden Markov Model not of necessarily isolated words but of a sentence made of words. So connected speech. And that extension, because we've used finite state models will turn out to be really quite easy. All we need is something that tells us how to join models of words together to make models of sentences. And then we can generalise further and say let's model subword units. For example, let's model phonemes. And then we need something else that tells us how to join phonemes together to make words. So the thing that tells us how to join words together to make sentences is called a language model. And as long as it's finite state, as long as we restrict ourselves to models of language that are finite state, there'll be no difficulty in combining it with the HMMs to make a single large finite state model. So this part is going to be easy connected speech. The elephant in the room, the thing we haven't talked about throughout is how to actually learn the model from data. Now you might think it would have been sensible to do that at the beginning and then to use the model to do recognition because of course that's what we do when we build the system. But that's the hard way to learn it. It's much easier to understand what the model's doing when it's emitting an observation sequence, when it's computing the probability during doing recognition. And once we've understood all of that, we can use that knowledge to understand the hardest part of the course overall. And that's how to learn the model from data. We call that training. Now computationally, it's going to be straightforward. It's going to involve very much the sort of computations we just saw there on the lattice. That's an attractive property of the Hidden Markov Model that it's computationally straightforward to train, to learn from data. But conceptually, you're going to find it perhaps the most challenging part of the course, which is why we left it to last till we're comfortable about HMMs and about how to do recognition with them. If you've got a good memory from the beginning of this module, you'll see that there's something missing from this picture. We haven't done it yet. And that was fitting a Gaussian to data. Well, we've mentioned that several times so far. So let's just firstly have a quick recap and see how that helps us in learning a Hidden Markov Model from data. That's the equation for the Gaussian. Whenever you see this equation, you'll zoom in on this part and then zoom in on this part and see first a Euclidean distance and then a Euclidean distance scaled by variance, a scaled Euclidean distance. And then the rest of the equation just makes it into a nice bell-shaped curve with an area of one. This course isn't about proofs. So we're not going to do a proof for how to estimate the parameters of the Gaussian. We're going to state them and then say that they look quite reasonable. So what are the parameters? Mean and variance. Mu and sigma squared. This is a univariate Gaussian, but that's all we need because we're assuming no covariance everywhere. So how do you estimate the mean and the variance of a Gaussian from some data? Well, first of all, you need some data. So let me write down some data points. Give them some notation. X1, X2, X3. They aren't necessarily in any sequence. They're just data points. There's no idea of ordering. So the data points that train a Gaussian, it has no model of sequences. Some big bag of data points. Let's have N of them, N data points. And these data points are labelled data. They're labelled as having come from this Gaussian. They're labelled with the class that this Gaussian is modelling. The requirement for labelled data to estimate the model parameters is normal in machine learning. And we say that the machine learning is then supervised. And that's the only sort of machine learning we're covering in this course, where the data are labelled with the class that we're modelling. So I'm going to state without proof that the best possible estimate of the mean of the Gaussian is the mean of the data. So I'm going to write down that my best estimate, so that's what the little hat means, my best estimate of the mean of the Gaussian is the empirical mean, the mean found through experimentation, found from data. And that's just going to be, add up all the data points and divide by how many there were. This estimate of the mean is the one that makes this data the most probable it can be given the Gaussian. So we can say that this estimate is a maximum likelihood estimate of the mean. You can see that intuitively. By setting mu to be the mean of all the x's, we'll make on average this term as small as possible, and that will make the probability as large as possible. I'm also going to state without proof that the best estimate of the variance is going to be the variance of the data. Now the variance of the data is the average square distance of the data from the mean or the average squared deviation. And we can say that intuitively because the variance is scaling the squared distance between data and mean. So I'm going to write that my empirical estimate of variance, variance hat, is going to be the square distance between data and mean summed up over all the data and divided by how many data points there were. It's the average squared distance from the mean. I'm going to state those without proof as being the most reasonable estimates, and they are. We could prove that these are the estimates that make this data have the highest possible probability, maximum likelihood, given this Gaussian. That's fine. There's a lovely, simple, easy formula. They work in one step. Take the data, average it, you've got the mean. With that estimated mean, take the data again, work out the average squared distance from the mean, and that's your variance. But that required this data to be known to come from this one Gaussian. Unfortunately, that's not the case for the Hidden Markov Model. Imagine I am fitting this Hidden Markov Model to this one observation sequence. I've labelled this sequence as having come from the model, but I am not able to label which frame in the sequence comes from which state of the model because the state sequence is hidden. I don't know that. And I am trying to estimate the mean and variance of a Gaussian here based on the data, and I don't know exactly how the model aligns with the data. That's a problem to be solved in the next module, and that's an algorithm for training the HMM given the data. We've seen that if we could find an alignment between the states and the observations, we could just apply those simple formula for estimating the mean and the variance. But the fact that the state sequence is hidden makes finding this alignment non-trivial. So that's to be solved in the next module.