Dynamic time warping

This is just a taster of what we'll cover in the main lecture, but do watch it so that you start to understand some of the key concepts.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN AUTOMATIC TRANSCRIPT WITH LIGHT CORRECTIONS TO THE TECHNICAL TERMS
So we've got a distance measure between one frame and another frame.
We've got an incredibly naive way of deciding which frames to compare with each other, and we can do speech recognition.
We can already do it.
We'll take this feature vector.
This feature vector
we'll compute the Euclidean distance between these things, and then we'll do that for all of the pairs that line up, sum them up and get some overall distance measure
which is the total distance between a sequence and another sequence.
We made an enormous number of very bad assumptions in doing that!
Almost everything we could think of is wrong with the system, so we're going to spend the rest of the course fixing that, bit by that.
But: this will do speech recognition.
It will work to some extent, but it won't be very accurate.
So let's start thinking about what's wrong with the way that we did that
By fixing them we'll eventually end up inventing Hidden Markov Models (HMMs).
First: speech does not stretch linearly.
Stops, plosives, sounds like that don't really hardly stretch at all.
Vowels are very stretchy.
There could be a much wider variance in duration.
So when we compare two things of a different duration we probably want to stretch the 'stretchy' 'parts more and the less 'stretchy' parts less.
Technically speaking!
We don't really want to do that through some sort of knowledge-based procedure, which wouldn't be possible anyway, because in the unknown, we don't know what the sounds are in there because we're trying to recognise the thing.
So we can't say, "Oh, we're trying to recognise this vowel. Let's stretch it a lot."
We don't know it's a vowel yet.
We have to automatically find out how to stretch either the template or the unknown, so they align with each other.
to have something that's automatically doing stretching.
And that's the dynamic part of the time warping.
It's dynamically changing the stretch factor at different points in the template and the unknown word.
So we're going to have to do something that's not linear.
That's different at different points in the word
And this local distance measure?
Well, it's just geometry!
What's that got to do with anything?
In the nine o'clock lecture we saw that it actually does have something to do with something.
And if we take this geometrical measure and start fixing up some of the problems with it, before we know it: we've also invented probability density functions and specifically the Gaussian distribution, which is something very close to the Euclidean distance measure, in some normalised space.
So this Euclidean distance is not stupid or crazy.
It's not going to be ideal yet
So we're going to deal with this problem first, which is the problem of aligning two sequences of different lengths in some optimal way, automatically.
So there's the statement of the problem
just to remind you, hopefully this notation is reasonable transparent.
Inside each of here is a little feature vector, okay?
And this thing here is a sequence of feature vectors,
and that's time in frames - are we happy with this notation - not too confusing?
things are horizontal, things are vertical - we've got to draw some pictures.
So there's some alignment between these things.
Imagine these are the same word (because we're optimists!)
We're going to try and align them the best we can
Imagine that's the template for 'one' and this unknown word happens to be a recording of the word 'one'
so we would like to align all the [w] bits with each other.
And all the schwa in the middle bits with each other and the [n] bits with each other somehow like this.
So we want to align the sounds that are most similar in the two words.
I can't think of anything that's better than that.
We wouldn't want to align the wrong sound with with some other sound.
We wouldn't want to do anything by time reversing it.
That would be ridiculous.
So the alignment has to have certain constraints, and the constraints are going to be that similar sounds in the unknown align with the most similar thing we can find in the template, in some sense, with the constraint that we can't time reverse.
So we can't do this.
We can't swap frames around.
That makes no sense at all.
Speech doesn't work like that.
When we speak faster, we don't start swapping sounds around in words.
So we have to come up with some definition of the correct alignment or the best alignment
We'll first do that.
And then we're going to find the most efficient algorithm we can, to find that alignment quickly.
So the question is "How can one thing compare to two things?"
So we've got this here where one thing aligns with two things
that will become clear hopefully when we get on to the time alignment.
Okay, so we're going to define what the best alignment is.
Okay, we could say something like: every frame in the unknown has to align with at least one frame of the template
and each frame of the template has to align with at least one frame in the unknown
and therefore it one's longer than the other, at some point two things in one of them must align with one thing in the other.
we'll see a diagram that will hopefully make this clear.
But if we align the two, all we're saying is that the distance is...
so there's a frame, there's a vector, here
there's a vector here.
We're taking the Euclidean distance between this and this.
Let's call that D_1
we're just going to add it to the Euclidean distance between this and this.
So they're just the added up in our sequence of local distances.
Now, since we're optimists and since we hope that at some point we will align the unknown word with correct template (hopefully it's in our vocabulary)
we'll align it with the right one and get a small distance
we can define the best alignment as the one that minimises this sum of local distances.
So these are local distances, and we're going to add them all up, to come up with some total distance measure.
We're going to define the best alignment as the one that makes this total distance as small as possible.
What else would you suggest?
To make it as large as possible would be ridiculous!
We'd by definition come up with ridiculous alignments like this one.
So the only sensible thing I can possibly think of that defines the best way to align a template with an unknown is to find the correspondences such that - when you add up all the local differences (distances) - the sum is the smallest possible.
It's as closely-matched as possible
that will make similar sounds in one tend to line up with similar sounds in the other because those will give small distances and therefore the sum of distances will be small.
So you just define the best (or the correct) alignment as the one that gives the smallest total distance: that's the sum of local distances between the two things.
Now we've got to find it.
when we've found it, we can just add up the local distances.
So that's what the rest of lecture is going to be about is: how can we efficiently find the best alignment between these two things?
And there are many, many possible alignments.
Let's draw some of them out.
There's one on the slide there.
Let's draw some other ones.
Here's one
okay and you could go home and write out all of the other ones you could think of.
And you'll see, there's an extremely large number of ways of aligning something with N frames with M frames, some combinatorial huge number.
And the longer the sequences get the worse that gets
If you've got to line something that's 100 frames long, with something 110 frames long: don't do that as homework because you'll be still doing it in December!
So there's a very, very large number of ways of aligning these two sequences, and it gets worse the longer they get.
And so we need to do that efficiently: computationally efficiently.
So we're looking amongst all the different ways of aligning these two sequences for the best one.
We're looking; we're searching.
and this is a key concept in speech recognition, at recognition time.
Searching amongst all the different possibilities for the one, in this case, with the lowest distance.
Eventually we'll be searching for the one with the highest probability when we get to HMMs, but it's exactly the same problem.
And this dynamic programming is going to be an efficient way of searching amongst these many, many possibilities without doing it the dumb way, but with getting exactly the same answer.
So before we go on, I'm going to define some terms because I'm probably going to be a bit loose with some terminology.
So let's just define what we mean by some of these terms.
You'll see different terms in the literature being used.
If we've got two vectors (two feature vectors) and we compare them
When we're talking about Euclidean distance, we might call this thing a distance, and it has a geometrical interpretation.
It's the distance.
If we plotted the vectors on the diagram, took a ruler and measured the distance between the two points, that's the distance: we will get the Euclidean distance, so I call it distance.
Small distances mean things are similar to each other, and the distance of 0 means they're identical to each other.
We have plotted our things on some vector plot.
When we had two vectors like this, they'd have a very small distance.
And as it goes to 0 they end up being identical
Very often, in the literature, we might refer to that in some other way.
We might call it a cost: it behaves in the same way as distance, big distances mean things are very dissimilar - it's bad: things don't match very well.
We could say that's 'high cost' or it's 'expensive'.
So these two terms, 'distance' and 'cost' are sort of interchangeable.
High cost / high distance: things very dissimilar, things are very far apart.
When we get on to probabilistic modelling, we'll switch our terminology.
We'll talk about things like probabilities and likelihoods.
They behave the other way around.
High probabilities mean things are similar to each other, means a data point is likely to have come from a particular distribution.
It's very close to the mean of that distribution.
So although both of these terms are fine bigger is better in probability and smaller is better in distance and cost
so we're going to start with these ones and later we're going to change.
So I'll try stick to using the word 'distance'.
But I might occasionally accidentally say 'cost', but they're going to mean the same thing
Let's write down the simplest, most stupid algorithm that will work: will give us the right answer but be very expensive to compute.
It will take a lot of operations: it will be slow
and then we'll get some insight into that and see that we could have computed exactly the same thing with no approximations in a much more efficient way.
so let's draw this diagram.
So you worked out by now, I like pictures, and as long as the notation's not confusing, the picture will help you understand.
But if the notation's confusing, of course, it will be worse than not having a picture.
So let's make sure we're all happy with the notation we're going to use here.
We're going to align two things: a template (I've drawn it along this axis of this square) and the template has got 9 frames.
Let's just spell out what the template really looks like
it's got feature vectors in sequence like this, and there are 9 frames
and that's time, but I've drawn it this way on the diagram
Along the other axis have drawn the unknown word, and that's got 7 frames.
So that looks like this.
And there are 7 frames in that unknown
the question is: find all the different ways of lining these things up with each other and from amongst all of them, find the one that - when you add up each of these local distances and get your big distance - that's the smallest possible
with the constraint that you need to do things in order.
So you need to go through this one in order and through this one in order, and you can't do any time reversal or anything like that.
And the other constraint: let's put another simple constraint in: the first frame has to align with the first frame and the last frame has to align with last frame.
We have to do that, otherwise we'll miss those frames out.
We gotta account for all the frames.
We've got these little constraints.
So that's what this diagram represents.
Each of the points in this diagram - take this one here - represents an alignment between one frame in the template on one particular frame in the unknown
it's one particular pairing of frames.
It says that this point here represents the alignment of frame 6 of the unknown with frame 3 of the template.
Let's just let's draw it out longhand.
There's 9 frames (badly drawn) of the template.
Here's 7 frames of the unknown and this point here says "Frame 6 of the unknown.
(123456), this one lines up with frame 3 (123) of the unknown.
It says these two things line up.
[WE'LL COMPLETE THE ALGORITHM IN CLASS...]