Conditional independence and the forward algorithm

We use the Markov property of HMMs (i.e. conditional independence assumptions) to make computing probabilities of observation sequences easier

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT
A LIGHTLY-CORRECTED VERSION MAY BE MADE AVAILABLE IF TIME ALLOWS
Let's get onto the harder part of this module, and that is how to train hit a Markov models.
So H bombs have a hidden state sequence.
Nuts.
What's fundamentally going to make the algorithm hard? We've seen that we can express the hidden state sequence by drawing out the lattice, and that's unrolling the hmm over time to show the hmm states exist in all times.
On.
That expresses the fact that there's many possible paths for recognition, for the process of doing inference with the model to find the most likely words.
Given the observations, we often only care about finding the most likely path, which is an approximation to the sum over all paths.
Once we've decided that approximations fine, we can use an exact algorithm to find that path.
So don't confuse the fact that finding a won best path isn't approximation to summing over all paths with exactly finding that one best path correctly.
So, knowing all of that, we're now going to derive a method for training.
A model mother was finding estimates of its parameters, given the data, and that's the bound Welsh algorithm to warm up to the bound Welsh algorithm, which is conceptually hard on DH.
We are only going to go for the concepts of it.
We're not going to work detail about the mouth that's in the readings to warm up for that, let's just recap this very famous equation that we should now understand on the left off this equation is what we actually want to compute for every possible w in the language.
So all the things that the person might have said when speaking to our speaks recognise er we want to compute the probability of each of those given the sequence of observations have arrived at our speaks, recognise her, and then we'll pick the W that maximises this term.
We can use Basil to rewrite that as the terms on the right.
We can write the probability of the observations, given the words.
That's the thing called the likelihood.
It's what our hmm computes multiplied by the probability of just the words.
There's no observations have been involved in that term, so that must be something you can compute without any observations.
In other words, you compute it before you see the observations, and that's known as the prior.
It's your prior belief, or your biases over which words are words.
Sequences are more or less likely that the language model on this unfortunate term here is the pride of probability of a particular observation sequence, which we have no idea how to compute.
But we don't need it because for anyone run of the speech recognition system, always a constant, soapy Evo is a constant, and so we typically cross it out.
Right proportional sign in here.
And don't worry about this term.
If we multiply together, the likelihood by a prior what we get is the probability after seeing the observations on the fancy word afterwards is posterior behind.
And so that p of W given those afternoons, the posterior.
The basing way to read this equation is to say that our prior beliefs about W before we made any observation, have Bean revised by receiving some evidence.
Given our prior beliefs, we received evidence.
Oh, we've revised our beliefs about W by multiplying the likelihood by our prior and this is our revised beliefs about W p l w not condition on anything.
That's our advanced belief.
What we believe before we hear anything language model, and then we have our POW.
Given some observations, that's the posterior after we received the evidence.
So we can think of this as a way off.
Revising beliefs, belief, revision so hidden stay sequins causes us problems.
Let's define some terms.
We want to compute the probability of a sequence of observations.
Given the model, we can compute two things.
The correct thing to do is to consider all possible state sequences.
We often use the random variable cue to refer to the state sequence.
Consider all possible state secrets is.
Q.
We marginalise Q away, and that involves summing over.
Q.
So we sum over all possible state sequences to give the total probability that the model emitted the sequence of observations that probabilities sometimes called the Forward probability and the algorithm for computing.
It is unsurprising, he called the forward algorithm.
That's the correct computation to do with hmm, because that could be expensive because it's a lot of things to sum up there.
We often approximate the forward probability with just the probability along the most likely state sequence.
So we just take a single most likely value of Q.
We search for that.
It's a search problem.
The Viterbi algorithm is the solution to that problem.
The Viterbi algorithm finds the most likely Q the one most likely.
Q.
We approximate the sum, the forward probability with the sum just along that Q.
And that's like taking the max.
This would appear to be the correct thing to do to do automatic speech recognition.
It is by definition, of the model, however we find in practise.
This works at least a CZ.
Well, there's a lot faster.
So the Viterbi algorithm is the default for doing inference for doing automatic speech recognition once the models being trained.
What we're going to talk about now is how to train the model, how to estimate its parameters, and we'll see that we need a more complicated album for that estimate.
The promise, the model given on observation sequence that we know came from that model.
So the data and labelled and that's the forward backward algorithm.
So we're just going to work our way up to that conceptually, and we'll finish by just seeing what it would look like to do the computations for the forward backward algorithm.
We leave the mass for the readings.
You'll never be asked to reproduce those math in an exam situation which want to get a conceptual understanding of how we would train this model.
All of these algorithms involves something like dynamic programming, the Viterbi algorithms and explicit form of dynamic programming.
But the forward algorithm and the forward backward algorithm also have ways of being extremely efficient weaken some over a very large number of possible state sequences in the most efficient way possible.
So the key property of the hmm that hope we understand by now is that I show you this model and I show you this observation sequence Exactly how that model generated observation sequence is known only to the model.
It's not known to us.
We observe the observations.
We don't get to see the state sequence.
You want a mental concept of this model emitting this observation sequence? The way to think about that as a Beijing would be to have the model emitting in all possible ways at once simultaneously.
So the model simultaneously does that on DH that on DH that and so on and so forth.
Every possible permutation off state sequence that could generate this observation sequence.
It's doing all of them once, and they have very improbabilities.
Those are just some of the many possible state sequences for this three emitting state model on the sequence of 66 observations.
Another view of the same thing now for a bigger model on the longer observation sequence is not to see the finite state machines is to look at the lattice.
And so now we don't need to draw sequence result on top of each other, because this shows us, all of them.
Let's just draw some paths through this lattice by which this, hmm, could align with the sequence of observations.
You know, there was that it could admit the sequence of observations.
Not all parts of this lattice here are reachable because off the topology of this.
Hmm.
So this part is reachable in this part.
These useless because we can't get to the end.
We have to finish here in this particular model will start here on these we don't ever get to from the start.
So this area here expresses all the different paths through this.
Hmm.
Could generate this observation sequence under all of these joined up.
So therefore, lattice, look something like this.
Sometimes people say trellis instead of lattice.
I think those people must be gardeners.
This lattice is a picture off the hmm emitting the observation sequence whilst taking all possible paths.
Benjamin was a very simple model.
It has no memory.
The state is everything.
So we say the state is the memory.
There's no external way to store any information.
So anything that model needs to know is encapsulated by being in a particular state.
What does that mean? In practise? Let's consider this model generating this observation.
Sequins on DH Being in this state when we generate this observation sequence, imagine this's part off the way that this model generated this observation sequence.
Knowing that means that how it generated this part of the observation sequence is entirely independent of this because we know the current state to the past in the future become independent, given the current state.
So we say they are conditionally independent.
What does that look like on the lattice? We'll zoom into a particular state.
That's now this state, anything.
This observation in the sequence that means that all the computations to get to that point which involved all of these paths here are conditionally independent off.
They have no bearing on the computation off how to get from the state to the end.
The green computations on the purple computations are entirely independent.
They could be done separately, for example.
The model is mark off its memory less.
Another way of saying that is all you need to know is what state you're in, not how you got there.
So if I say that we're in this state than the probability off this state admitting this observation depends only on that state.
In other words, what gal scenes air in there.
And most importantly, it doesn't depend on how this observation was emitted.
It could have also come from that state.
But we've forgotten.
Or it could have come from that state with forgotten.
And it doesn't depend on how are we going to generate this observation, whether we're going to stay in the same state or whether will have moved to the next date.
We don't need to know that the probability of an observation depends only on the current state.
So what we say is that the observations are independent of each other, given the state.
So we're not saying that observations are just independent of each other.
They're independent, given the state, so they are conditionally independent.
That's not quite such about assumption.
We're saying that all you need to know is this state.
What that means when we doing computations on the lattice is that if we have arrived here, we want to find the probability of admitting this observation.
Everything we need to know is locally is contained here.
It's in the guardians for this state.
We don't need to remember how we got here.
We don't.
You know how we're going to leave.
Computations happen locally on that simplifies computation enormously.
So there's the key properties oven.
Hmm.
So now what's the computations do we need to do to be able to train such a model? Let's just recap the one that we do recognition with.
That's the Viterbi algorithm.
Well, when you put a token in the start, I'm run tau comm passing, so we wantto comm passing at some point token comes out the end.
Having generated all the observations in the sequence is token.
If we wished could record it state sequence and we could recover that and tell us the one best state sequence.
That's one computation we need for the Hmm? What is that look like on the lattice? Well, he just finds us one path from start to end, for example, it might find us this one and no others.
And it'll tell us the probability of the model generating the reservation sequence and taking that one state sequence.
And that's what we used to do.
Recognition.
If you want to exactly compute the total probability or this model emitting this observation sequence, we need to sum over all state sequences.
We need to marginalise out the state sequence, and that will be to do the correct computation to do this some and not the max.
And that's called the forward algorithm.
This becomes harder and harder to understand on this finite state picture much easier to understand on the lattice.
So again, let's just draw the part of the leftist that's actually reachable for this particular model topology.
In this particular observation sequence on, we're now going to some across all paths through this lattice.
Into that summation.
Let's just pick a place where we doing some of that computation.
When we arrived here, with some path coming in this way, the bath coming in this way.
Plus the probability off emitting this observation, Who some of these probabilities together, Write it there and then that propagates forward through the lattice.
Until here is the total probability of going from the start to the end and emitting the observation sequence using all possible sequences.
So summing away Q.
You could do recognition with that, and you would use the probability here in the final state as the final probability, the total probability that this model emitted this observation sequence.
So now the hard part, the bound Welsh algorithm in the Bow Marshall gone.
We're going to have to compute something more specific.
And that's the problem with being in a particular state at a particular time.
And we'll need that to know whether that state emitted a particular observation or not.
And if we knew that, we could use the observation update that states calc ins