Training – introduction

The problem we need to solve is that we don't know the alignment between states and observations.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS
we're gonna go straight on now.
To the hard part, of course, possibly the most conceptually hard part.
We're going to do it in entirely non mathematical terms.
We haven't talked very much about transition probabilities.
We're not going to say a whole lot about here when we train them.
I was gonna mention how they might be trained.
The values on those transition probabilities aren't so important.
They're not doing a whole lot of work.
In other words, they're not particularly discriminative between one class and another class.
The Galaxy is really where it's at, so the transition publishes a very crude, more love duration.
People try putting much more sophisticated models of duration in for no win except for a large computational costs.
So duration is not a hugely descriptive Q.
In speech recognition.
It's really distribution in them FCC space that tells us what So we know how to estimate the parameters of a galaxy of probability density function.
So pdf means remind ourselves it's called a density rather than just a probability function because it's for a continuous value thing and just think of it as a scatter plot.
How dense of those points in each region of space.
So it's called.
It's a density.
It's no, it doesn't give you an absolute probability.
It doesn't matter because it's equivalent on it gets is what we need.
So it's probably density function.
We stated without proof, that the way to estimate the parameters is to maximise the likelihood of the training data Given the model.
In other words, to turn the knob called me up and down until the training data looked as likely as possible and simultaneously turned the knob called variance up and down until it made the training data is like it's possible.
And those simple estimates of taking the mean of the data taking the various of the data do that wearing a theoretical course would actually prove from first principles that wasthe best estimate, which is going to stay to hear.
One important thing to note there is that the galaxy and only needs to see data that we thought was generated by that calcium.
We don't need to see data from other Galaxies.
Other classes, purely generative paradigm.
We only learn from positive labelled examples were not learning, for example, just discriminate against other classes.
So the model of the word eight is just learn from your seven recordings of the word eight, and it doesn't need to look at sevens and make sure it's a fact that generating seven.
We just hope that that's the case, because it's good at generating somewhat simplistic advance systems might go further than that and try and separate the classes.
But we're not doing that here.
Now.
We want to have an estimate of the Galaxy ins in the hmm states.
We immediately got a problem that we've got sequences of observations of variable length HMO's with more than one emitting state.
We don't know which state generated which observation, and that's the problem to solve in training.
We're going to solve it firstly, through ridiculously simple and naive method.
Then we're going to use a slightly better, reasonable method still an approximation, and they were going to look what we really do remember in testing and decoding when we compute the probability that a model generated an observation sequence, the correct thing to do by definition of hidden Markov model will be to add up the probabilities of all the different state sequences that could have generated that some there probabilities together That's the probability of the model generating that sequence.
That's rather expensive because there's a lot of state sequences, a test time recognition time.
We really care about speed, about computational costs that really matters.
So we make an approximation.
We just look for the single most likely state sequence that's found by dynamic programming by the Viterbi algorithm.
That's what Toa combusting gets us on.
That probability of the single most likely sequence is a pretty good approximation to the total probability.
Apparently we find that that's good enough.
That gives us just just a good recognition results.
If we did the right thing, was competition much faster training time.
We don't care nearly so much about computational costs.
It's done once off line before we need to run the system and so we're going to do the right thing in training.
We are going to consider every possible states.
When we doing training.
We should really do it during recognition time.
It's too expensive and it doesn't really help performance but training time.
It is worth going this extra mile to do the right thing, but we're going to use this approximation is one crude form of training that gets us partway towards the right.
Just remind ourselves then our empirical estimate of the mean that's what little hat on the mu means.
It means it's not really the mean of calcium.
So some speed comes into the recognise er, we're going to pretend for the purposes of recognition that the thing that generated it was actually on.
Hmm.
That's the paradise we're working.
And it wasn't it was a person, but we're gonna pretend it was an hmm.
Therefore, there is a relation.
I'm out there in the world generating the speech that we're trying to recognise, that it really has a value for the meaning, the variance.
There are three values, but we can empirically estimate those values by looking at the data that this agent was generating.
That's what little hat means on top means an empirical estimates off mean on DH.
The estimate is very simple.
Some together, all the observations associate with this calcium, what we need to solve on take that mean so just divide by the number on the same for the variance.
Okay, this is just the squared distance.
So the difference squared because I want to make it symmetric.
again someday.
Over all the data points and divided by.
So it's just a means quite a difference.
Like the width of the calcium on average.
How far are the data? Points with me.
How broad is this distribution? So we say those without proof we're going to apply those.
Except we don't know which observations to sum up.
So these observations here, this implies all of them in a sequence.
But some of them will come from one state.
Some will come for the next day.
Someone come from the next state.
We have to make that association.
So we know which ones to Adam and then divide by m.