The Viterbi algorithm and token passing

The Viterbi algorithm can be computed on different data structures. The token passing version turns out to be very convenient for ASR.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT
A LIGHTLY-CORRECTED VERSION MAY BE MADE AVAILABLE IF TIME ALLOWS
So I've seen that there's a transition of ideas from dynamic time warping, where the data structure wass a simple grid to this new data structure for the hidden Markov model called the Lattice, where the allowable transitions between the states in the lattice are dictated by the topology of the model, not by some simple external rule.
So the lattice is a general data structure that would work for any shape, any topology off.
Hmm.
When on that lattice on that data structure, we can perform the algorithm off dynamic programming.
When we apply that to the hidden Markov model, it gets the special name off the Viterbi algorithm.
So we done with dynamic time warping.
We learned it because it's a nice way to understand dynamic programming.
We've had this transition now to the hidden Markov model in a good way.
To think about the hidden Markov model is a generalisation off dynamic time warping with templates.
The template.
The store exemplar was a bit arbitrary.
It was just one recording off someone saying This digit, for example, Andi didn't capture variability either, acoustically around the average way of saying it or indeed, in terms of duration, is just one particular way of saying it.
So we've generalised replaced that with a model.
In general, the model have fewer states than there were frames.
In the template is a generalisation.
Each of the state's contains a Multivariate Gaussian that expresses a statistical description off what that part of the word sounds like.
So, for example, it replaces thes three exemplar frames with their means and variants on the model has a transition structure transitions between the states that does the job off, going through the model in the right order.
So for most models in speech that will be left to right on DH off, how long to spend in each part of the model? And that's a model of duration, the very simple model of duration.
But it is a model of duration, so it's very important that Allah Grant developed for the hidden Markov model doesn't rest on this left to right topology, even if it is the most common, I might occasionally want to use them all that has this transition, for example, maybe I'll have a model ofthe silence that could just go round and round, generating as much silences.
I need beginnings and ends of utterances that might put that transition in the model.
For that purpose, the lattice still works.
Most restore the lattice for this particular topology that has this transition going from the third emitting state back to the first one, or in h d K.
Language from the fourth state to the second state because the non emitting one gets numbered one.
So for this model, let's draw the lattice.
Well, the lastest apology just dictated by the model.
So from this state, we could go here or here.
There should be arrows and all of these from the state.
Same thing from this state we can narrow.
Go to the non emitting final state or self transition.
Oh, go back to hear.
Go here.
Well, leave.
It would be leaving too early.
All go back to the start.
Yeah.
Ah, back to the start.
Yeah, the back to the ah, yeah, Back to there.
This lattice is a little bit more complicated, but it's still just a set off states joined by transitions on any algorithm that we can perform on the simple artist we can perform on this one on.
What we're going to see now is that we can do the computations on that lattice.
That's a perfectly reasonable way to do it on this data structure or we confined.
Actually, an alternative implementation, which uses the hmm itself is the data structure.
So actually, there are two ways at least to implement the Viterbi algorithm for Asia.
Mum's one is to make the lattice on to perform dynamic programming on the lattice.
You start here, you go to the first emitting state, you admit on observation.
Then you either go here or here, and you admit the next observation in sequence.
Now, each state gets to emit that observation in general, imitate with a different probability and update the probability of the partial path so that the two partial paths alive at this point, and they'll compete later when they meet.
But for now, we must maintain both of them.
So we're doing a parallel search and then we'll just propagate forward through the model like that.
There's a one way of doing it.
But there's another way of doing it directly on the hmm structure.
So I'm going to make a connexion for you now between the way of doing it on the lattice to the left.
That would be how most textbooks describe the algorithm andan algorithm, which is going to use the hmm itself on the right as the data structure.
It's not going to make the lattice never going to draw this lattice out.
He's going to use the H members, the data structure on It's gonna be equivalent.
And that algorithm is called token passing.
Along comes a sequence of observations, and we're going to emit them in sequence.
So is the first one on the left will have made the transition to this state.
I will admit it and compute the probability of doing that on the right.
Would have put a token here, will have sent the token down an ark.
It will arrive here.
We use that galaxy in to admit that observation will update the tokens probability with that probability and then comes in the next observation in sequence.
We'll go forward here and compute that go forward here and in it it again from a different state.
But two different state sequences possible.
Now this is the state sequence.
22 This is the state sequence two three amusing actually K numbering the counterpart over here would be that one.
Copy.
The token goes around self transition and ends up back here.
Another copy, the token Because the next state and end up here, This one emits the observation.
And so does this one on the update.
Their probabilities.
We have two tokens alive in the model at once.
This token, the same as this point in the lattice.
This token is the same as this point in the lattice.
And then we just turn the handle on the algorithm, repeating those steps over and over again.
So things I mean something in the middle of the algorithm.
At some point, we're generating this observation on DH.
Imagine we were going to generate it from here from this emitting state.
Two paths will have arrived from previous time.
Step will have done the max operation to keep only the most probable.
And then we'll generate this observation from the Gaussian off this state on update that probability they're the counterpart of that will be over on this side.
Were arriving in this state.
One token will have come from this direction.
One token will be self transitioning token.
That was already in that state of the previous time step, We'll do the max to keep just the most likely those two tokens on DH.
Discard the other one, and then we limit this observation and update the tokens.
Probability hopefully can see that The thing of the laughter, the thing on the right of doing the same computation on the left.
We need a data structure which is a lattice on the size of the lattice is going to be governed by the length of the observation sequence on the number of states in the model.
Now, that doesn't seem very big here because our eventual observation sequence this has got five observations, and our model only has three emitting states.
So our lattice only has 15 states in it, but in general, are models might be much, much bigger than that.
Observation sequences might be considerably longer tens or hundreds of frames long, and so this lattice could get quite large and use a positive memory.
So the implementation on the right could use a lot less memory because it only uses thie.
Hmm.
Itself is the data structure.
So the size of its data structure is just the size of the hmm.
How does the algorithm on the right get to you? So little memory? Well, essentially, it's just computing this and then this and then that could be discarded.
And then this that could be discarded.
And then this could be discarded, and then this.
Then that could be discarded.
So token passing is actually just a very particular form of search on the lattice.
It's called time synchronous.
It computes in strict order off time.
On the only needs as much memory is the hmm hmm exists each time.
So this one, it's token.
Passing one on the left is the search on the lattice.
And we could imagine in this lattice that we might not want to strictly explore it in this order.
That's only one way of doing it.
We already saw in dynamic time warping there's other ways of doing it.
We could explore it in more interesting orders.
For example, that will be not time synchronous would have moved through time before we'd finished going through all the model states.
But for the person of this course, we're only really bothered about time synchronous search on Soto comm Passing is a perfectly good way to understand the Viterbi algorithm for H.
M.
M's.
So you know the hmm.
Now it's a finite state network, and inside each state, is there Gaussian acting as a generative model.
It can generate a single observation, and then we must leave the state.
Either go around a self transition or to another state.
It doesn't matter.
The model topology tells us what's allowed on.
Every time we arrive in the state, we use its calcium to emit the next observation in sequence.
We've only seen that for a single hidden Markov model.
I have been saying all the time that the way that we do classification is by having one hidden Markov model for each of the classes that we want to recognise on DH will compute the probability of each of them generating observation sequence and pick the highest.
What's coming next is actually a way off connecting all the models together to make a single hidden Markov model not off necessarily isolated words, but of a sentence made of words.
So connected speech.
And that extension, because we've used finite state models, will turn out to be really quite easy.
All we need is something that tells us how to join models of words together to make martyrs of sentences.
And then we can generalise further and say, Let's model sub word units, for example.
Let's model phone aims.
And then we need something else that tells us how to join Foa names together to make words.
So the thing that tells us how to join words together to make sentences is called a language model.
And as long as it's finite state, as long as we restrict ourselves to models of language that are finite state, there'll be no difficulty in.
Combining with the hmm is to make a single, large, finite state model.
So this part is going to be easy connected speech.
The elephant in the Room The thing we haven't talked about throughout is how to actually learn the model from data.
Now.
You might think it would have been sensible to do that at the beginning and then to use the model to do recognition.
Because, of course, that's what we do.
Want to build a system that's the hard way to learn it much easier to understand what the models doing when it's emitting an observation sequence, its computing, the probability join doing recognition almost have understood all of that.
We can use that knowledge to understand the hardest part of the course overall, and that's how to learn the modern from data we call that training Computational E is going to be straight forward.
It's going to involve very much the sort of computations we just saw there on the lattice.
That's an attractive property.
They hit a mark off model that its computational e straightforward to train to learn from data.
But conceptually, you're going to find it perhaps the most challenging part of the course, which is why we left it toe last till we're comfortable about H.
M.
M's on about how to do recognition with them.
You got a good memory.
From the beginning of this module, you'll see that there's something missing from this picture.
We haven't done it yet on that wass fitting galaxy into data.
Well, we've mentioned that several times so far, so let's just first, we have a quick recap and see how that helps us in learning a hidden Markov model from data.
That's the equation for the galaxy.
In when have you seen this equation? You zoom in on this part and then zoom in on this part and see first a Euclidean distance.
And then the Euclidean distance scaled bye variants, scaled Euclidean distance, and then the rest equation just makes it into a nice bell shaped curve with an area of one.
This course isn't about proofs, so we're not going to do a proof for how to estimate the parameters of the gas.
And we're going to stay them and then say that they look quite reasonable.
So what are the parameters mean on variants? Mu on Sigma Squared is a unit, very a Gaussian.
But that's all we need because we're assuming no co variance everywhere.
How do you estimate the mean on the variance of a galaxy in from some data? Well, first of all, you need some data, so let me write down some data points.
It was some notation next one, thanks to next, really aren't necessarily in any sequence.
There just data points.
There's no idea of ordering to the data points that training Allison.
It has no model of sequences, some big bag of data points.
Let's have an off hm n data points on these data.
Points are labelled data.
They're labelled as having come from this calcium, the label with class that this garrison is modelling.
The requirement for labelled data to estimate the model parameters is normal in machine learning.
We say that the machine learning this then supervised and that's the only sort of machine learning we're covering in this course where the data are labelled with a class that were modelling.
I'm going to state without proof that the best possible estimate of the mean of the galaxy in is the mean of the data.
I'm going to write down that my best estimate.
That's what little hat means.
My best estimate off the mean of the galaxy in is the empirical mean the mean, found through experimentation found from data.
And that's just going to be out of all the data points and divide by how many there were.
This estimate of the mean is the one that makes this data the most probably can be given the calcium.
So we can say that this estimate is a maximum likelihood estimate off the mean concede that intuitively, by setting you to be the mean of all the axis will make on average, this term of smallest possible, and that will make the probability as large as possible.
Morgan State without proof that the best estimate off the variance is going to be the variance of the data, that the variance of the data is the average squared.
Distance off the data from the means or the average squared deviation.
And we can see that intuitively, because the variance is scaling the squared distance between data and mean.
So I'm going to write that my empirical estimate ofthe variants Syrians hat is going to be the squared distance between data and mean summed up over all the data on divided by harmony data points worth the average squared distance from the mean.
I'm going to state those without proof as being the most reasonable estimates.
And they are.
We could prove that these are the estimates that make this data of the highest possible probability maximum likelihood.
Given this calcium, that's fine.
There's a lovely, simple, easy formula.
They work in one step, take the data, average it you've got the mean with that estimated mean take the data again.
Look at the average squared distance from the mean on DH.
That's your variants, but that required this data to be known to come from this one gal Siem.
Unfortunately, that's not the case for the Hidden Markov model.
Imagine I am fitting this hidden Markov model to this one observation sequence I've labelled This sequence is having come from the model, but I I'm not able to label which frame in the sequence comes from which state of the model? Because the state sequences hidden.
I don't know that.
And I am trying to estimate the mean on variants of the galaxy in here based on the data, and I don't know exactly how the model aligns with the data.
That's a problem to be solved in the next module.
And that's an algorithm for training the hmm, Given the data, we've seen that if we could find on alignment between the states and the observations, we could just apply those simple formula for estimating the meaning, the variants.
But the fact that the state secret is hidden makes finding this alignment nontrivial that's to be solved in the next module