Hidden Markov Models for ASR

Intro to Hidden Markov Models, comparison to DTW

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT
A LIGHTLY-CORRECTED VERSION MAY BE MADE AVAILABLE IF TIME ALLOWS
we've been on a journey towards the hidden Markov model, and finally we've arrived.
We started by talking about pattern matching.
We did that with an algorithm called dynamic Time warping, in which we stored a template which is just a recorded example off.
Some speech would like to recognise, perhaps are isolated word called an exemplar on.
We matched that against some unknown speech that we would like to label when we measured distance between the store template on the unknown.
We then realised that this lacks any way of modelling the natural variability of speech.
The template is just a single store example.
It doesn't express how much we can vary from that example and still be within the same class.
So we decided to solve that by replacing the exemplar with a probabilistic model.
When we looked in our toolbox off probabilistic models and found the galaxy in is the only model that we can use.
It has the right mathematical properties, and it's very simple.
We're going to use the galaxy in the core of our probabilistic model, and that led us into a little bit of a detour to worry about the sort of features that we could model with a Gaussian.
We spent some time doing feature engineering to come up with features called MFC Sees that is suitable for modelling with diagonal co variance calc ins.
So now, armed with the knowledge of dynamic time warping on knowing its shortcomings and knowing that we've now got these excellent features called Mel Frequency Capture coefficients, we're ready to build a probabilistic generative model off sequences off M FCC vectors and that model good with the hidden Markoff model.
We're going to have to understand what's hidden about it.
What's Mark off about it, So to orient ourselves in our journey, this is what we know so far.
We talked about distances between patterns we saw.
There's an interesting problem there of aligning sequences of varying length.
We found a very nice algorithm for doing that called dynamic programming on the instance e ation of that algorithm dynamic time warping, and we saw that we could do a dynamic time warping.
We could perform the algorithm of dynamic programming on a data structure called a grid, and that's just a place to store partial distances.
We're not going to invent the hidden Markov model.
We'll understand what's hidden about it.
It's it's state sequence on that's connected to the alignment between templates on DH Unknown State sequence is a way of expressing the alignment.
And then we'll develop an algorithm for doing computations with hidden Markov model that deals with this problem of the hidden state sequence.
That data structure is going to be called a lattice, and it's very similar to the grid in dynamic time warping.
Think of it as a generalisation of the great, And so we're now going to perform dynamic programming on the different data structure on the lattice that will be doing dynamic programming for Hidden Markov models on the album gets a special name again.
It's called the Viterbi Algorithm.
That's the Connexion between dynamic time warping as we know it so far and hidden Markov models as we're about to learn them.
So we know that the single template in dynamic time warping doesn't capture variability.
Though many solutions proposed that before someone invented that Hmm.
For example, storing multiple templates to express variability that's a reasonable way to capture variability is a variety of natural exemplars.
The hmm is going to leverage that idea, too.
It's not going to store multiple examples, it's going to summarise their properties in statistics.
Essentially, that just boils down to the means and variances of Gazans.
So although there are ways of dealing with variability in down every time warping we're not going to cover them.
We're going to go straight into using statistical models, where, in some abstract feature space, let's draw a slice of capital features base forthe first of the seventh capital, coefficients of some sliced through this cultural feature space.
We're going to have multiple recorded examples off a particular class on rather than store.
Those were going to summarise them as a distribution, which has, I mean on DH, the standard deviation along each dimension or variance if you prefer.
So here's what we're going to cover in this module module eight.
You already know about the multi very galaxy in a generative model, a Multivariate Gaussian generates vectors.
We're going to start calling them observations because they emitted from this generative model.
We know that speech changes its properties over time, so we're going to need a sequence ofthe Galaxy ins to generate speech to need a mechanism to do that.
Sequencing on that amounts to dealing with duration.
We're going to come with a very simple mechanism for that which is a finite state network.
And so when we put these galaxy and generally models into the states of a finite state network, we got the head of Markov model.
We're going to see when the model generates data.
It could take many different states sequences to do that, and that's why we say the state sequence is hidden.
It's unknown to us, and we have to deal with that problem.
One way of doing that is to drive the data structure, which is the hidden Markov model replicated every time instant that's called the Lattice.
We're also going to see that we could perform dynamic programming directly on a data structure that is the hidden Markov model itself, and that's an implementation of difference.
Both of those are dynamic programming for the hidden Markov model.
On both of those ways of doing things, either the lattice well, this computation directly on Hidden Markov model itself are the Viterbi algorithm.
When we finish, all of that will finally remember that we don't yet have a way of estimating the parameters of the model from the data.
So everything here in module eight we're going to assume that someone's given us the model.
Some pre trained model.
We don't have the algorithm for training the model yet, so the very end will make a first step towards that will just remind ourselves how we contrarian a single gal.
Sian on some data might call that fitting the galaxy into the data or estimating the mean on variants of the galaxy.
And given some data, we'll see that that is easy.
But the hidden state sequence property of the hmm makes it a little harder to do that for hidden Markov model on a sequence of observations.
But that's coming in the next module.
So we know already that a single galaxy in a Multivariate Gaussian Khun generate feature vectors.
Are we going to call them observations? Let's just to remind ourselves that they are generated by a model.
Let's just make sure we fully understand that again.
I draw some part of our feature space.
I can't draw in 12 dimensional FCC space or higher, so I'm always going to draw just two of the dimensions any too.
It doesn't matter.
Two of the nfcc coefficients.
How does a Gaussian generate while the calcium is I mean, I mean is just a point in the space and we have a variance along each of the featured I mentions we're going to steam was no co variance.
And so variants of standard deviation in this direction, perhaps in this direction.
So one way of drawing that is to draw the contour line one standard deviation from the mean like that this model Khun generate data points.
What? I mean, it means we can randomly sample from the model on DH.
It will tend to generate samples near the mean on just how spread they are from the mean is governed by the perimeter variants.
So if I press the button on the model to randomly sample it generates a data point or one of the data point or another data point just occasionally a data point far from the mean.
More often than not, they're never mean so generous data points that have this Gaussian distribution.
That's the model being a generative model.
But we're not doing generation.
We're not doing synthesis of speech.
We're doing recognition of speech.
So what does it mean when we say there's a generative model.
Well, it says we are pretending.
We're assuming that the feature vectors extracted from speech that we observe coming into our speaks recognise er we're really generated by a Gaussian.
We just don't know which galaxy in on.
The job of the classifier is to determine which of the Garson's is the most probable galaxy into have generated a feature vector.
Of course, the feature vectors weren't generated by gas in their extracted from speech, and the speech for generated by the human vocal tract were making an assumption.
We're assuming that their distribution is galaxy in that we could model with a generative model as a gas in distribution.
So it's an assumption now, since these blue data points were extracted from natural speech on what we're trying to do is decide whether they were generated by this galaxy, and we don't really generate from the galaxy in we just take the gassings parameters.
We take one of these data points.
We compute the probability that this observation came from this calcium.
We just read that off the Gaussian curve as the probability density on we'll take.
That's proportional to the probability so the core operation we're going to do with our galaxy and generative model whilst doing speech recognition is to take a feature vector and compute the probability that it was emitted by a particularly calcium.
And that's what the garrison equation tells us.
It's the equation of the curve.
That's the equation that describes probability density in one dimension.
Got the Gaussian distribution along comes from Data Point, and we just read off the value of probability density.
It's that simple.
When we do that in multiple dimensions, with an assumption of No Cove Arians, we'll just do the one dimensional case separately.
For every dimension a multiply, all the probability density is together to get the probability off the observation vector, and that's an assumption of independence between the features in the vector.
So that mechanism off the multi variant calcium as a generative model, it's going to be right in the heart of our hmm that we're about to invent it solves the problem off the Euclidean distance measure being too naive.
How does it solve it? Because it learns the amount of variability in each dimension of the feature vector and captures that's part of the model the Euclidean distance measure only knows about distance from the mean it doesn't have a standard deviation parameter.
We cast our minds back to dynamic time warping, and just remember how that works.
This is the unknown, not the time running through the unknown.
This is a template that we've stored will assume that the first frame of each always aligned with each other because speech has been carefully end pointed.
There's no leading silence with a line.
This frame with this frame and then down any time warping just involves finding a path to hear.
That sums up local distances along the way, and those local distances are Euclidean.
Distance is here.
The galaxy is going to solve the problem off the local distance.
Let's take a local distance in this cell between this factor, this factor, and it's going to capture variability.
Storing a single frame of exemplar doesn't capture variability.
That's the main job of the galaxy in.
But there's another problem off templates.
It doesn't capture variability in duration, either.
There's a single example on arbitrarily chosen example of a recording of a digit say, Maybe this is a recording of me saying three.
The duration of that is arbitrary.
It doesn't model the variation in duration.
We might expect.
The template might also be a little bit redundant.
It might be that all of these frames sound very similar there.
The vowel at the end of the world.
Three.
We don't need to store three frames.
For that.
We could just store one or even better, because starve the meaning, the variance of those frames.
So we're going to get rid of the storey exemplar.
Are we going to get to choose the temporal resolution off the thing that's replacing it? The thing that's replacing it's going to be not an example, but a model.
And so my example are hard seven frames.
But maybe I don't need seven frames to capture the word three.
Maybe it's made off just three different parts.
Three different sounds.
So at the same time as getting with the exemplar, I'm going to gain control over the temporal resolution.
The granularity off my model on DH.
For now, let's choose three.
This is a design choice.
We're going to get to choose this number here, and we can choose it on any basis we like, But probably we should choose it to give the best accuracy of recognition by experimentation.
So where I used to have frames oven exemplar.
I'm now going to instead store the average on DH, the variance off those.
And I'm going to do that separately for each of the three parts.
So for this part here, I'm going to store a Multivariate Gaussian in the same dimensions, the feature space and it's going to store.
I mean, under variants on DH for the post of this module, not going to worry about where they come from.
I'm just going to assume that we will later devise some mechanism.
Some algorithm for training this model.
We're always going to see him for the rest of this module that the model has bean pre trained.
It was given to us.
So instead of storing an exemplar here, I'm going to store a calcium.
I'm going to draw one in one dimension because I can't draw in this many dimensions going to be a mean.
I understand the deviation, or we could still the variance for the middle part.
Going to store another one is going to be potentially different, mean on a different standard deviation and for the final one, the third one and yet another mean another standard deviation.
So where they used to be, a sequence off feature vectors For each of these three parts, there is now a statistical description of what that part of the word sounds like.
So this access is the model on the model is in three parts.
It models the beginning of the word, the middle of word and the end of the word, and we'll just leave it like that is a little abstract.
Certainly we're not claiming that these are going to be phoney names on DH.
We've already said that it doesn't have to be three.
It's a number of our choosing, and we'll worry about how to choose it another time.
So the model so far is a sequence off three housing distributions.
I've drawn them in one dimension, but in reality, of course, they're multi vary it on.
All we need now is to put them in some mechanism that makes sure we go through them in the correct order that we go from the beginning to the middle to the end bond the mechanism that also tells us how long to spend in each of those parts.
The duration of the unknown is going to be variable every time we do recognition.
That might be a different number, as we'll see when we developed a model the length of the observation sequence here, general going to be longer on the number of parts in the model.
And so we've got to decide whether this Gaussian generates just this frame or the first two, or maybe the first three.
And then when we how do we move up to the next gas into start generating the next frames on when we move up to the next gas ian, start generating the remaining frames? Sony's a mechanism to decide when to transition from the beginning of a word to the middle of a word from the middle of the world to the end of the world.
The simplest mechanism that I know to do that is a finite state network.
Let's draw finite state network.
We're going to build a transition between states on these transitions, say that we have to go from left to right through the model.
We don't go backwards.
Speech doesn't reverse on.
We now need a mechanism that says that you, Khun from each of these states generate more than one observation.
Remember, these are the observations, and that's just going to be a self transition on the state.
Inside the states are the multi very gas fumes that I had before with their means and variances.
So we started with a generative model off a single feature vector a Multivariate Gaussian on.
We've put it into a sequencing mechanism of finite state network that simply tells us in what order to use our various galaxy ins on DH.
We have some transitions between those states that controls that, and we have some self transitions on the state that allow us to emit more than one observation from the same Gaussian.
Those transitions are our model of duration.
We just invented the hidden Markov model.