Whilst the video is playing, click on a line in the transcript to play the video from that point. So we've seen that there's a transition of ideas from dynamic time warping, where the data structure was a simple grid, to this new data structure for the Hidden Markov Model called the lattice, where the allowable transitions between the states in the lattice are dictated by the topology of the model, not by some simple external rule. And so the lattice is a general data structure that would work for any shape, any topology of HMM. And on that lattice, on that data structure, we can perform the algorithm of dynamic programming. And when we apply that to the Hidden Markov Model, it gets the special name of the Viterbi algorithm. So we're done with dynamic time warping. We learned it because it's a nice way to understand dynamic programming. We've had this transition now to the Hidden Markov Model. And a good way to think about the Hidden Markov Model is it's a generalization of dynamic time warping with templates. The template, the stored exemplar, was a bit arbitrary. It was just one recording of someone saying this digit, for example, and it didn't capture variability either acoustically, around the average way of saying it, or indeed in terms of duration. It's just one particular way of saying it. And so we've generalized, we've replaced that with a model. In general, the model will have fewer states than there were frames in the template. It's a generalization. Each of the states contains a multivariate Gaussian that expresses a statistical description of what that part of the word sounds like. So for example, it replaces these three exemplar frames with their mean and variance. And the model has a transition structure, transitions between the states that does the job of going through the model in the right order. So for most models in speech, that will be left to right, and of how long to spend in each part of the model. And that's a model of duration. It's a very simple model of duration, but it is a model of duration. So it's very important that any algorithm developed for the Hidden Markov Model doesn't rest on this left to right topology, even if it is the most common. I might occasionally want to use a model that has this transition. For example, maybe I'll have a model of silence that can just go round and round, generating as much silence as I need at beginnings and ends of utterances. And I might put that transition in the model for that purpose. The lattice still works. Let's just draw the lattice for this particular topology that has this transition going from the third emitting state back to the first one, or in HDK language, from the fourth state to the second state, because the non-emitting one gets numbered one. So for this model, let's draw the lattice. Well, the lattice topology is just dictated by the model. So from this state, we can go here or here. There should be arrows in all of these. From this state, same thing. And from this state, we can now go to the non-emitting final state or self-transition or go back to here. We can go here or leave. That would be leaving too early. Or go back to the start. There, there, back to the start. There, there, back to there. There, there, back to there. This lattice is a little bit more complicated, but it's still just a set of states joined by transitions. And any algorithm that we can perform on the simpler lattice, we can perform on this one. And what we're going to see now is that we can do the computations on that lattice. That's a perfectly reasonable way to do it on this data structure. Or we can find actually an alternative implementation which uses the HMM itself as the data structure. So actually there are two ways at least to implement the Viterbi algorithm for HMMs. One is to make the lattice. And to perform dynamic programming on the lattice. You start here, you go to the first emitting state, you emit an observation. Then you either go here or here and you emit the next observation in sequence. Now each state gets to emit that observation. You will in general emit it with a different probability and update the probability of the partial path. So there'll be two partial paths alive at this point and they'll compete later when they meet. But for now we must maintain both of them. So we're doing a parallel search and then we'll just propagate forward through the model like that. That's one way of doing it. But there's another way of doing it directly on the HMM structure. So I'm going to make a connection for you now between the way of doing it on the lattice to the left. That'll be how most textbooks describe the algorithm. And an algorithm which is going to use the HMM itself on the right as the data structure. It's not going to make the lattice. It's never going to draw this lattice out. It's going to use the HMM as the data structure. And it's going to be equivalent and that algorithm is called token passing. Along comes a sequence of observations and we're going to emit them in sequence. So there's the first one. On the left, we'll have made a transition to this state and we'll emit it and compute the probability of doing that. On the right, we'll have put a token here. We'll have sent the token down an arc. It will arrive here and we'll use that Gaussian to emit that observation. We'll update the token's probability with that probability. And then comes in the next observation in sequence. We'll go forward here and compute that. Go forward here and emit it again from a different state. So it's two different state sequences possible now. This is the state sequence two, two. This is the state sequence two, three. I'm using HDK numbering. The counterpart over here would be that one copy of the token goes around the self transition and ends up back here. Another copy of the token goes to the next state and ends up here. This one emits the observation and so does this one. And they update their probabilities and we have two tokens alive in the model at once. This token is the same as this point in the lattice. This token is the same as this point in the lattice. And then we just turn the handle on the algorithm, repeating those steps over and over again. So let's examine something in the middle of the algorithm. At some point, we're generating this observation and imagine we were going to generate it from here, from this emitting state. Two paths will have arrived from the previous time step. We'll have done the max operation to keep only the most probable. And then we'll generate this observation from the Gaussian of this state and update that probability there. The counterpart of that will be over on this side. We're arriving in this state. One token will have come from this direction. One token will be a self transitioning token that was already in that state of the previous time step. We'll do the max to keep just the most likely of those two tokens and discard the other one. And then we'll emit this observation and update the tokens probability. So hopefully you can see that the thing on the left and the thing on the right are doing the same computation. On the left, we need a data structure, which is a lattice. And the size of the lattice is going to be governed by the length of the observation sequence and the number of states in the model. Now that doesn't seem very big here because our eventual observation sequence has just got five observations and our model only has three emitting states. So our lattice only has 15 states in it. But in general, our models might be much, much bigger than that. And our observation sequences might be considerably longer, tens or hundreds of frames long. And so this lattice could get quite large and use up a lot of memory. So the implementation on the right could use a lot less memory because it only uses the HMM itself as the data structure. So the size of its data structure is just the size of the HMM. How does the algorithm on the right get to use so little memory? Well, essentially it's just computing this and then this, and then that can be discarded. And then this, that can be discarded. And then this, that can be discarded. And then this, and that can be discarded. So token passing is actually just a very particular form of search on the lattice. It's called time synchronous. It computes in strict order of time and it only needs as much memory as the HMM. The HMM exists each time. So this one is token passing. The one on the left is the search on a lattice. And we could imagine in this lattice that we might not want to strictly explore it in this order. That's only one way of doing it. We already saw in dynamic time warping there's other ways of doing it. We could explore it in more interesting orders, for example and that will be not time synchronous. We'd have moved through time before we'd finished going through all the model states. But for the purposes of this course we're only really bothered about time synchronous search. And so token passing is a perfectly good way to understand the viterbi algorithm for HMMs. So you know the HMM now. It's a finite state network and inside each state is a Gaussian acting as a generative model. It can generate a single observation and then we must leave the state either go around a self transition or to another state. It doesn't matter. The model topology tells us what's allowed. And every time we arrive in a state we use its Gaussian to emit the next observation in sequence. We've only seen that for a single Hidden Markov Model. But I've been saying all the time that the way that we do classification is by having one Hidden Markov Model for each of the classes that we want to recognise and we'll compute the probability of each of them generating observation sequence and pick the highest. So what's coming next is actually a way of connecting all the models together to make a single Hidden Markov Model not of necessarily isolated words but of a sentence made of words. So connected speech. And that extension, because we've used finite state models will turn out to be really quite easy. All we need is something that tells us how to join models of words together to make models of sentences. And then we can generalise further and say let's model subword units. For example, let's model phonemes. And then we need something else that tells us how to join phonemes together to make words. So the thing that tells us how to join words together to make sentences is called a language model. And as long as it's finite state, as long as we restrict ourselves to models of language that are finite state, there'll be no difficulty in combining it with the HMMs to make a single large finite state model. So this part is going to be easy connected speech. The elephant in the room, the thing we haven't talked about throughout is how to actually learn the model from data. Now you might think it would have been sensible to do that at the beginning and then to use the model to do recognition because of course that's what we do when we build the system. But that's the hard way to learn it. It's much easier to understand what the model's doing when it's emitting an observation sequence, when it's computing the probability during doing recognition. And once we've understood all of that, we can use that knowledge to understand the hardest part of the course overall. And that's how to learn the model from data. We call that training. Now computationally, it's going to be straightforward. It's going to involve very much the sort of computations we just saw there on the lattice. That's an attractive property of the Hidden Markov Model that it's computationally straightforward to train, to learn from data. But conceptually, you're going to find it perhaps the most challenging part of the course, which is why we left it to last till we're comfortable about HMMs and about how to do recognition with them. If you've got a good memory from the beginning of this module, you'll see that there's something missing from this picture. We haven't done it yet. And that was fitting a Gaussian to data. Well, we've mentioned that several times so far. So let's just firstly have a quick recap and see how that helps us in learning a Hidden Markov Model from data. That's the equation for the Gaussian. Whenever you see this equation, you'll zoom in on this part and then zoom in on this part and see first a Euclidean distance and then a Euclidean distance scaled by variance, a scaled Euclidean distance. And then the rest of the equation just makes it into a nice bell-shaped curve with an area of one. This course isn't about proofs. So we're not going to do a proof for how to estimate the parameters of the Gaussian. We're going to state them and then say that they look quite reasonable. So what are the parameters? Mean and variance. Mu and sigma squared. This is a univariate Gaussian, but that's all we need because we're assuming no covariance everywhere. So how do you estimate the mean and the variance of a Gaussian from some data? Well, first of all, you need some data. So let me write down some data points. Give them some notation. X1, X2, X3. They aren't necessarily in any sequence. They're just data points. There's no idea of ordering. So the data points that train a Gaussian, it has no model of sequences. Some big bag of data points. Let's have N of them, N data points. And these data points are labelled data. They're labelled as having come from this Gaussian. They're labelled with the class that this Gaussian is modelling. The requirement for labelled data to estimate the model parameters is normal in machine learning. And we say that the machine learning is then supervised. And that's the only sort of machine learning we're covering in this course, where the data are labelled with the class that we're modelling. So I'm going to state without proof that the best possible estimate of the mean of the Gaussian is the mean of the data. So I'm going to write down that my best estimate, so that's what the little hat means, my best estimate of the mean of the Gaussian is the empirical mean, the mean found through experimentation, found from data. And that's just going to be, add up all the data points and divide by how many there were. This estimate of the mean is the one that makes this data the most probable it can be given the Gaussian. So we can say that this estimate is a maximum likelihood estimate of the mean. You can see that intuitively. By setting mu to be the mean of all the x's, we'll make on average this term as small as possible, and that will make the probability as large as possible. I'm also going to state without proof that the best estimate of the variance is going to be the variance of the data. Now the variance of the data is the average square distance of the data from the mean or the average squared deviation. And we can say that intuitively because the variance is scaling the squared distance between data and mean. So I'm going to write that my empirical estimate of variance, variance hat, is going to be the square distance between data and mean summed up over all the data and divided by how many data points there were. It's the average squared distance from the mean. I'm going to state those without proof as being the most reasonable estimates, and they are. We could prove that these are the estimates that make this data have the highest possible probability, maximum likelihood, given this Gaussian. That's fine. There's a lovely, simple, easy formula. They work in one step. Take the data, average it, you've got the mean. With that estimated mean, take the data again, work out the average squared distance from the mean, and that's your variance. But that required this data to be known to come from this one Gaussian. Unfortunately, that's not the case for the Hidden Markov Model. Imagine I am fitting this Hidden Markov Model to this one observation sequence. I've labelled this sequence as having come from the model, but I am not able to label which frame in the sequence comes from which state of the model because the state sequence is hidden. I don't know that. And I am trying to estimate the mean and variance of a Gaussian here based on the data, and I don't know exactly how the model aligns with the data. That's a problem to be solved in the next module, and that's an algorithm for training the HMM given the data. We've seen that if we could find an alignment between the states and the observations, we could just apply those simple formula for estimating the mean and the variance. But the fact that the state sequence is hidden makes finding this alignment non-trivial. So that's to be solved in the next module.
The Viterbi algorithm and token passing
The Viterbi algorithm can be computed on different data structures. The token passing version turns out to be very convenient for ASR.
Log in if you want to mark this as completed
|
| ||||||||||||||||||||||||||||||||||||||||||