Pattern Matching, Alignment, Dynamic Time Warping

Search a grid with Dynamic Time Warping

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN AUTOMATIC TRANSCRIPT WITH LIGHT CORRECTIONS TO THE TECHNICAL TERMS
So we've talked about storing exemplars, which are just sequences of feature vectors extracted from known recordings with labels, and the idea of computing a distance to the unknown (the thing we're trying to recognise)
we realised that the only hard part of that distance computation is aligning these two variable length sequences with each other so that we know which local distances to add up.
So we're going to have to find some efficient way of considering (exploring) every possible way of aligning the two sequences.
we should do that in the most efficient way possible, because there might be a very large number of such alignments.
There's going to be a beautiful and elegant algorithm to do that called dynamic programming.
Dynamic programming is the general term for this whole family of algorithms that have the same property
As it's applied here, because these are two time sequences, we're going to be warping them in a non-linear away, so this is known as Dynamic Time Warping.
I used the term 'pattern matching' earlier.
Let's just be clear what those words mean.
These are the patterns, and the matching is trying to find the global distance between these two patterns
the global distance will be the one that gives the most favourable alignment of the two frames and then sums up all the local distances.
We want to develop a dynamic programming algorithm, a special form called Dynamic Time Warping.
So how do you develop an algorithm?
That sounds pretty hard!
It's usually easiest to start - when you develop your algorithm - with some step that's actually in the middle of the algorithm, once it's running: the standard iterative step that runs again and again, the one that repeats over and over
and understand that step.
Then, after that, deal with a special cases of starting and ending.
Let's do that then.
We're aligning these two sequences.
There are many, many possible alignments, but we've got part way through the alignment.
We're doing the following.
We're working our way through the template one [frame] at a time, and we're working through the unknown and finding the alignment between the two.
To keep this example simple, I'm going to say my algorithm works through the template one frame at a time.
So each iteration of my algorithm will go one step through the template, and will decide whether to make a step through the unknown, or not.
So the two eventually align.
That's not fully general but we'll generalise that in a moment.
Let's imagine I've got to a state in my algorithm where I have decided to consider comparing (= aligning) the fourth frame in the template with the third framing the unknown.
I'm going to say, "What about the case were these two align?"
One way of doing that, when we have arrived at that point in the alignment, would be to align like this
I could write that out as follows:
the first number is the frame in the template, just counting through in time like this 1 to 7
the second number is the frame in the unknown: counting through the unknown.
So I've done (1,1) (2,2) (3,3) (4,3).
Now let's think about all the other alignments between these two sequences that also align (4,3).
There's this one and there's this one.
Remember, I'm making a simple assumption that I'm just counting 1 to 7 through the template, and I decide how fast it moves through the unknown
So I'm dynamically moving through the unknown, as I just advance through the template.
In this simplifying case, these are the only possible alignments, given those constraints.
See that I have not yet done the computation of what happens next.
All three possible alignments all arrive in this state here, of aligning 4 with 3.
One of these three will be the best alignment.
What's the best alignment?
Well, it's the one that sums up local distances...
for example sums up this one, this one, this one and this one.
That would be the first case
The local distances are just the Euclidean distance between the corresponding frames.
Or I might consider this one.
I sum up those four local distances to find the distance-so-far of getting to here
Or might consider this one from those four local distances and get the distance so far of getting to here.
In general, these three things will have a different distance-so-far of arriving at this point in the alignment/
One of them will be the smallest: that'll be the best one.
And here comes the dynamic programming insight!
Without going any further in the alignment - without doing any of the remaining computation that I need to get to align the two ends of the sequence - I can already choose which of these is best.
Let's imagine that one's the best, and I can abandon these two.
Why?
Because whatever happens after this point here is completely independent of what happened beforehand.
Let's write that out.
If I assume that these two align, then how the frames before that align line and how the frames after it align have no interaction.
They're independent of each other.
The best alignment to the left and the best alignment to the right can be computed independently.
In other words, if I was to do the dumb thing of computing all three of these all the way through to the end - so I kept on going and computed them (and there's of course many possibilities here: there's more than one way of going forward, so lots of computation still to do) - the best way of aligning the remaining frames has got nothing to do with how aligned so far.
In other words, the computation required here is identical to that required here and here, because they're all starting from the same place.
So, I'd be wasting my time!
If I know that this gives the lowest distance so far, summed across those four local distances, there's no point wasting my time doing this computation because it's just the same computation as I'm going to do here anyway.
And since we know that's not winning at this point, they're not gonna win by the end, either.
That's the insight of dynamic programming: that we can save computation because we see that there's repeated computation - repeated sub-parts of the problem
Let's state that again, because this is the key key point of dynamic programming.
The state of the algorithm at which frame 4 and frame 3 align makes everything to the left independent of everything to the right
We can cut the problem into two parts: two halves.
We can compute this part and this part quite separately.
We don't need to repeat all the possible combinations.
That's conceptually quite hard and you probably won't get it the first time.
We're going to go through that, through the full algorithm, in a moment.
Then we're gonna come back and do it all again with hidden Markov models.
What I really want you to do is to understand Dynamic Time Warping as a form of dynamic programming: as an algorithm.
And that's already getting harder, on average, for this course.
Being able to see that dynamic programming can be applied to a particular problem, such as sequence alignment is really quite difficult.
If I presented you with a new problem and said, "Could you apply dynamic programming to this?"
That's really quite difficult.
I would like you to be able to get to that point that you can see it is applicable to two sequences of feature vectors, and later it will be applicable for hidden Markov models.
What is super hard and I don't expect you to be able to do is to actually devise the data structure required for doing dynamic programming.
We're just going to present the solutions to that.
First, dynamic time warping and later hidden Markov models.
We'll never ask you in this course to devise a new data structure and write out a new dynamic programming algorithm for some new problem.
We just really want you to see that dynamic programming works as an algorithm.
What the key insight is: that we could make the past independent of the future by knowing the present.
To able to see that dynamic programming is applicable, but not going the extra step of actually deriving it for a new problem.
We're now going to spell out Dynamic Time Warping as an algorithm.
The algorithm is going to need a data structure in which it stores the information as it proceeds.
That data structure is going to be very simple.
In fact, it's actually going to be something we call a grid.
Here's the problem.
We're exploring all possible alignments between these two sequences: a template (or an exemplar) with seven frames and an unknown with five frames.
And we'd like to consider all possible ways of aligning each of the seven frames with each of the five frames.
So the core operation in that is considering a pairing of frames: an alignment.
For example, here the sixth frame of the template aligns with the third frame of the unknown.
The key element - the little 'building block' of the algorithm - is this alignment between a frame in one sequence and a frame in the other sequence.
So how many possible such pairings are there?
We've got seven things each of which could align with five things.
How many possible pairs?
You could pick one thing from here on one thing from here and pair them up.
How many such pairs that are there?
Well, there's just 7 x 5: each of the seven things could align with each of the five things.
So our data structure is going to need 7 x 5 elements.
The obvious thing then would be to make a grid that's 7 x 5.
That's what we'll do.
I'm going to draw out the template on the vertical axis.
Remember this is the sequence of feature vectors.
That's time, and that's frequency moving up through the filters in the filter bank.
So you could think of those little spectrograms if you prefer.
I'll space them out a bit because I want to draw my grid as big as possible.
So that's the grid.
That's the data structure that the algorithm is going to use.
We've got two sequences of feature vectors, one that has the label and one that we're considering whether to give it the same label, if it's sufficiently similar (if its distance is small enough).
We're trying to compute that distance as a sum of local distances.
We want the smallest possible sum of local distances - the most optimistic one - because this is the only sensible thing to do.
So we've got global distances and local distances going on in this grid.
Let's just spell it out.
What's a local distance?
Well, at some cell which represents aligning this feature vector with this feature vector, there is some local distance we could compute.
We could take these two vectors.
We could compute the Euclidean distance between them and that will be the local distance at that cell in the grid.
We've also got the global distance, which is a sum of local distances for a particular alignment of the two frames.
So let's just draw it out for one particular alignment.
Imagine that we always align the first and the first frame, and the last and the last frame.
An alignment would be a path through the grid.
It might look like this:
We align those two and take their local distance.
Then we move here.
We'll add the local distance between these two and then we might go here.
Here, here? Yeah, and here on
The global distance of this path (of this way of aligning the two sequences) is this local distance plus this one plus this one, this one, and this one.
It was this one plus this one
What we're trying to do is to explore every possible way of getting from here to hear and find the one the adds up local distances to make the smallest possible global distance.
There are many, many paths through this grid.
I'm not gonna draw them all out because we don't have time.
Let's start drawing a few to give you the idea.
There could be really, really stupid ones like this
That's really unlikely: that the first frame of the unknown aligns with all of the frames of the template and then we stay at the end of the template and alignthat last frame with all the remaining frames of the unknown.
Or we could go like this, or we could go like this, or we could go like this, and so on...
You get the point!
All the way to the one that does this.
A very large number of ways.
We're trying to do that in an efficient way.