More Dynamic Time Warping

More on dynamic programming and DTW

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN AUTOMATIC TRANSCRIPT WITH LIGHT CORRECTIONS TO THE TECHNICAL TERMS
Let's just relate that back to the picture we had before when we drew the two sequences out on top of each other.
That was one of the possible sequences, and that would be like moving through the grid in this fashion.
And so the distance so far of arriving at this point and assuming that this is an alignment would be this local distance plus this local distance, plus this local distance plus this local distance
What we'll store in this cell of the grid would be the distance-so-far of coming from the beginning to that cell in the grid.
So let's draw out of the full algorithm.
I'm going to generalise now.
I'm going to say that as we move through the two sequences, we have to move monotonically.
In other words, as we advance through the frames in this one, we can't go back and consider a previous frame because speech certainly never does that.
It's monotonic, but we can advance through one frame here, or one frame here or one in each at the same time
What does that mean on the grid?
Well, let's always assume that the first two [frames] align.
It means that the allowable moves in the grid go this way, this way, and this way.
That's all we're going to allow.
Let's write the rules of the game.
We have to start here at (1,1).
We have to get here at (7,5)
I want to find the lowest distance path between those two points.
That's just a sum of local distances.
We got to visit grid cells along the way, and every time we visit a cell - for example, if we visit this cell - we have to accumulate the local distance between this frame and this frame.
If we 'step' on that cell, it costs is that amount of local distance.
We've got to get from here to here, stepping on the cells as we go, accumulating local distances, and try and get there by accumulating the smallest amount of total distance (= global distance) along the way.
I'll draw those two in because we're always going to align first with first and last with last.
Now, I just showed you that we could enumerate every single possible path from the start to the end, compute the global distance, and write it down.
After having done all of them, pick the one with the lowest distance.
That would work.
You'll get the right answer.
That'll be a lot of repeated computation.
Those paths would have an awful lot in common.
Many, many sub-sequences (= sub-paths) will be the same.
And that's wasting computation.
So dynamic programming gets us around that.
It avoids repeating any computation.
Everything is computed just once, and that computation is shared with all the different paths that require that bit of computation.
Let's just do the algorithm from the beginning.
See how that works.
So I'm going to start here.
The distance so far is the local difference (distance) between this and this.
So that's the partial path cost (= distance).
We often see the word 'cost' used in the literature, so cost and distance are the same thing.
A high cost means high distance: it's a bad thing
We have accumulated some distance.
So the partial distance in this cell is just the local distance because we've only accumulated one.
Now what can I do in my algorithm?
Well, I said the allowable moves were to go this way so I could move here, and I would accumulate a second local distance between this frame and this frame.
I add that to what I'd already had from the previous cell, and that would be my partial distance-so-far.
I write into this cell that distance, which is just gonna be this local distance plus this local distance.
That's it.
Do the same for the other moves.
I'm allowed to go here, and I'll compute the local distance between this pair of frames and add it in and write it into that cell
that partial path distance that's the cost (or the distance) of getting to that cell from the beginning.
And finally, I'll do this one, get a local distance and write that in there.
So far, so good.
That's really very simple.
Now things get interesting in the next step of the algorithm.
Let's imagine extending this path this way.
What's the partial total distance now?
Well, it's that local distance, plus that local distance, which would already be written to the cell.
So it's whatever was in this cell, plus this local distance.
Write that into that cell.
But there's another way of getting to the same cell.
I could start here with the accumulated distance so far and move here.
I've now got two competing paths arriving in this cell and one of them will be cheaper than the other.
One has a lower distance than the other
Dynamic Programming tells me I just take the lowest of those and write is in the cell.
So if I computed this purple one first, I wrote it in the cell.
Then I computed the green one and I could get there more cheaply, so I just overwrite that value with the lower value
I would take the minimum of all the values.
So I take the cheapest way of getting to that cell and just write that value into the cell.
And that's the core step of dynamic programming.
How has that saved us computation?
Let's consider starting from this cell and finding the best path to the end.
It's going to involve repeating the operation over and over again.
Well, we do not have to take the purple path and find it to the end and take the green path and find it to the end because one of them's already won at this cell.
So we only have to pick the winner and propagate that one forward and we're effectively abandoning the other one.
So the core operation in Dynamic Time Warping happens at a cell that we can arrive out from multiple directions.
The global distance-so-far that we write into that cell is just equal to the minimum of these, plus the local distance here and that's it.
We just repeat this operation for every single cell in the grid until we do it for this cell, which, of course, we could arrive out in this direction this direction on this direction.
So what is the minimum of these three things and add the last local distance in and then in that cell is a number - just call it D for ;global distance' - that number is the cheapest way of getting from the beginning to the end, accumulating the smallest amount of local distance on the way.
So that's the core operation, and all you need to do is just visit the cells of the grid in appropriate order, such that all the computation need has been done by the time you get there
That'll be very easily achieved by, for example, visiting them row by row in this order - that would work - or visit them in this order - that would work - or any other order you can think of such that whenever you get to a cell tou've previously visited all the cells here before you get there so that the partial distances are available for you to take the min( ) of, write right into this cell plus that local distance.
The amount of computation in this algorithm is to do with the length of this sequence times the length of this sequence.
So 35 operations to be done to compute total distance.
You've just got to visit each cell once.
Compare that to enumerating the paths that one by one.
That would involve this much computation.
And don't worry, I'm not going to write them all out because we'll be here for hours.
It would involve adding this to this...
So that's that's already 11 computations to do that one path.
And then we forget that we've done that and we do 9 more computations.
Another 9 computations, etc etc for all of the paths until you get to ... another 11 here.
Add all of those together, and you don't have to enumerate them out to see that that's much, much bigger than 35.
Much, much bigger because we're visiting the same cells over and over again.
Dynamic programming avoids that: it visits each cell once.
It does each computation once.
And that's the key idea: sharing computation between all the sub-parts of the problem.
And that was allowed because knowing that we visit this cell in an alignment makes everything before then independent of what happens going forwards.
In other words, it doesn't matter how you got here.
You could have come from any of these directions.
You don't need to remember that because it doesn't change how you get to the end.
There's no dependency between them: the future is independent of the past.
Given the state or the current cell, we might say, "The past and the future become independent when we know the present.""
The present is everything we need to know.
There's no memory in the system.
That concept is a little hard, but we'll see it again with hidden Markov models.
And the key of the hidden Markov model will be to make sure it has no memory so that the past and the future become independent, given the present.
And then we get to do dynamic programming!
So it willturn out that will invent a model specifically so that we could do dynamic programming with it!!
We started with the idea of the distance between two sequences of feature vectors and realised that the only difficult part of that is deciding how to align them with each other, so we know which local distances to add up to make a global distance to announce as the distance between the two sequences.
We would repeat that dynamic programming for each of the templates (each of the exemplars) we have.
One of them will have the lowest global distance, and that will be announced as the label of the incoming word we're recognising.
We developed the algorithm.
In the heart of the algorithm, really is a very simple operation of taking the minimum.
We had a very simple data structure in which to store the results of the operation: a grid.
We're going to go off and do some feature engineering in the next module and then come back to this idea.
We'll realise that Dynamic Time Warping has many flaws.
That's why it's no longer used for speech recognition.
We're going to fix them all, gonna fix all of that with a hidden Markov model.
We're gonna understand what this word 'hidden' means.
It's the state sequence that's hidden.
That's another hard concept to come.
But we'll find that the data structure that we might use to do dynamic programming with a hidden Markov model is something quite like a grid.
We're going to call it something different, going to call it a lattice (something more general than a grid).
On that data structure we'll also be able to do almost exactly what we just did with Dynamic Time Warping: the same sort of dynamic programming.
It's going to get given a different name because it's applied to a different model, is going to be called the Viterbi algorithm
To summarise that then:
Dynamic Time Warping and, in particular, this local distance measure which we didn't say very much about today (this Euclidean distance) is poor.
It doesn't account for variability, so it doesn't have any idea of natural variance in speech.
So we'd better fix that and we're gonna replace it with a probabilistic model which will be learned from multiple exemplars: multiple ones to capture the amount of natural variation.
So we won't learn a model from a single recording of someone saying 'three'/
We'll get them to say three lots of times, or get lots of people to say three lots of times
We'll form a probability distribution of the feature vectors for that word.
That variability will be captured using Gaussian probability density functions.
The Gaussian has not just a mean but also a variance to capture variability, but also potentially something called covariance, which is how one element of a feature vector covaries with the other elements.
That covariance is going to turn out to be a rather awkward thing.
We'd like to pretend there was no covariance.
We're actually going to do some pretty heavy duty processing to our feature factors to try and make sure there isn't any covariance between the elements of the vector.
In other words, that they become statistically independent of each other, they do not covary.
So we're going to do feature engineering, motivated by our desire to use Gaussian probability density functions, and our very strong desire not to have to model covariance.
So we're going to do feature engineering, motivated by our desire to use Gaussian probability density functions, and our very strong desire not to have to model covariance.
We're going to try and remove covariance.
The filterbank features are going to exhibit a lot of covariance, and they're not going to be suitable.
We're going to do some further steps on them.
So we're going to do that feature engineering in the next module.
We'll develop these features, which for very many years were the state-of-the-art in automatic speech recognition, called Mel Frequency Capital Coefficients (MFCCs).
We see mel still in there.
The mel filterbank is going to be the first stage in extracting those features, but we're going to go further to remove covariance.
When we've got those features, they will be suitable for modelling with Gaussians that can model the mean and the varianceof each element of the feature vector, but assume there's no covariance between the elements.
That will then be the core of our hidden Markov model.
We'll put those Gaussian probability density functions into the states of a finite state machine, and that will be the hidden Markov model.

Log in if you want to mark this as completed
Excellent 59
Very helpful 5
Quite helpful 4
Slightly helpful 1
Confusing 0
No rating 0
My brain hurts 1
Really quite difficult 4
Getting harder 9
Just right 51
Pretty simple 4
No rating 0