Linear time warping

The simplest way to deal with variable duration is to stretch the unknown word to have the same duration as the template.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN AUTOMATIC TRANSCRIPT WITH LIGHT CORRECTIONS TO THE TECHNICAL TERMS
So with that in mind as our feature vector.
This thing here
we're going to just use the fast Fourier transform to extract that for now from 25 milliseconds of speech and get a vector of some hundreds or thousands of numbers.
Don't worry too much.
We do that every 10 milliseconds.
So in this diagram Now, this is time.
We're going to do it for one word
w're going to do for another word.
And then we're going to get to the crux of the matter.
How do we tell how far apart these two things are? Are they the same word? Are they different words? Different recordings vut the different recordings of the same token, sorry the same TYPE or not?
Okay, so this this idea here of distance is the crucial concept we're going to get to in today's lecture.
Later on, we'll replace this idea of distance with a statistical idea.
The idea of probability or likelihood.
How likely is it that one word is the same as another word?
But today we're taking a non statistical, non probabilistic view.
How far apart are they? Think of it as a difference or distance
the idea's are going to lead naturally, one into the other.
Let's imagine we've got these two recordings of words.
We've got this recording here.
Let's just play one like that
we're going to measure the difference between these two recordings of words.
One Sorry this one.
One - that's me saying one
and I'd like to measure the difference between that and this other recording of me saying this word one one
So we could immediately see there's going to be a bit of a problem here because one of them is longer than the other.
So the first problem is they don't line up properly with each other.
So in general, speech has variable duration.
That duration might not be very informative about which particular word it was.
These are both perfectly reasonable examples of the word one, one is just longer than the other.
So first thing, our pattern matching techniques going to have to deal with is that the exemplar stored pattern and the thing we're trying to recognise, not the same duration.
I have to line them up with each other.
So let's just do that in a really dumb way.
Let's just stretch one of them
So we have a template
We know its label because we recorded it, especially for the purpose of building our system.
It's a template of the word one.
Just remember, each of these little boxes here is really a vector.
It's a snapshot of the spectral envelope.
If you're not comfortable with thhis idea, just think it's the formants at that moment in time and that everything's a vowel.
We've got an unknown word that somebody spoken into our recogniser, and our job is to say, Is it a one or not? So that's our That's our job.
Of course, if this recognise is going to be any use, it doesn't just say Is it a one or not? It's got templates for all the words we might expect.
So we had to plan ahead when we build the system, what words we might expect.
We got templates for All of them we're comparing against all of them.
We're going to do that one time and just remember the distances and then pick the one with the shortest distance.
So we're down to this core problem here, of the distance between the template and the unknown thing and the first thing we're going to worry about is the fact that they're not the same duration and what that means is if the frame rate is fixed at every 10 milliseconds.
So this time interval here is 10 milliseconds.
That's time
there's a different number of frames in one of them to the other one
So it's not at all obvious which frame to compare to which frame.
So we deal with that problem first.
So let's do the simplest thing we can think of.
And that's just a stretch one.
And if we stretch it (not in the waveform domain, as we did there) but in the feature domain in the sequence of frames we'll just line up the frames in this way.
So our unknown one here is a bit shorter.
Just stretch it and then we'll just approximately line them up and we'll say that the beginning of this word here, we're going to compare it to the beginning of the word here.
That seems reasonable.
So we're going to see if the words is the same as another word we'll compare the beginning and the middle and the end
in fact we just compare it every frame
and then this frame here, well, compare it to this frame here.
This frame here will also compared to this frame here, so we'll do that.
This one we'll compare to this one.
These two.
These two; probably these ones, This one and this one
So make some approximate alignment between them.
We've just done that by linearly stretching the shortest one as long as the longest one or linearly stretching the unknowns that fits the template.
So if it's too long we squash it.
It's too short:s stretch it.
We'll just make some approximate alignment between the frames.
Let's imagine that's good enough.
Let's imagine speech just linearly stretches.
If I say one and one, all the sounds just stretch the same amount.
Let's let's pretend that that's okay so you can see the method here, we're just going to pretend things and keep it really simple.
Work out why that didn't work on, then make things a little more complicated in small steps.
So we'll go with linear stretching so linear time warping as a way of lining up quick repetitions of words with slow repetitions of words.
And remember, it might be the case that the unknown and the template are different.
Then in general, they are going to be different durations.
So we've lined up the two things of different durations.
And now what we're going to do, we're going to measure all these local differences.
This frame with this frame, this one with this one, this one, this one This one This one This one and this one
make these correspondences.
So in our real example, here, what we're saying is that once we've linearly stretched things, we're just going to compare this 25 milliseconds of speech about this much 25 ms speech with the bit that just lines up with it here.
This bit with this bit this bit with this bit and so on.
We're going to assume that that's the [w].
And that's also the [w].
And after stretching, they just end in the same place.
That's the vowel.
And that's the nasal.
Likewise here we're going to compare short bits of speech here with short bits of speech here
rather than stretching in the time domain, which is an expensive operation, extract features and then just stretch the frames of the features.
So they line up with each other.
It seems reasonable.
I mean, it would not be reasonable to compare this bit of the word with this bit of the word: that doesn't make any sense at all.
It does make sense to compare the beginning with the beginning, the middle with the middle in the end with the end.
And if they're all similar, then we're going to say the two words are similar and therefore this is likely to be two repetitions of the same type.
The same word
so we're going to do our pattern matching then
we're going to find the global distance between the unknown thing and the template as the sum of all the local differences.
are the beginnings the same? are the middles the same? are the ends the same?
break the big problem down into all these little small problems.
So now the question is this thing here, remember this is a vector of numbers, and this one here is a vector of numbers.
These are of the same length because they are the same type of thing - they're both the fast Fourier transform of 25 ms of speech, or the set of formants or whatever else you want to think of them as
how do we say how far apart these two things are? What's the distance between two vectors
an easy way to think about vectors is in geometry.
So let's pretend these vectors have just got two elements.
I'm just gonna pretend they're two, because my writing surface here had got two dimensions.
I can't draw in three dimensions.
So we just think of them, you know, X and Y
it will generalise: anything we can do in two dimensions we can do in any number of dimensions.
We just find it more and more difficult to draw them.
So we should pretend that those features have extracted have two dimensions (first format and second formant, for example).
And so a really simple distance is just to think in geometry.
So here's X
In any vector, it's got some X value and some Y value
this vector is just a point in this space.
Are we happy with this idea? two numbers just being a point in space?
So we go the Y amount, the Y amount, so there's one point, and there's the point we're trying to compare it to.
So that's this one.
And we've drawn them as two points in the vector space.
We're just going to compute the geometrical distance between them.
We can do it on paper we could plot them on paper and get the ruler out.
Measure it.
How many millimetres apart are they? That's going to be our distance.
That's easy to automate.
It's a simple equation for that, because it's just this nice right angle triangle.
We know that the sum of the squares on the two sides is equal to the square on the hypotenuse, right - happy with this level of geometry?
So this distance between them is just the sum of these difference of these squares.
We take the square root and get some distance.
So there's a distance between two vectors
and this equation happily, generalises to 3 dimensions or 39 dimensions or 1000 dimensions.
It doesn't matter.
Just just expand the equation instead of X and Y.
I've got X Y Z... any number of coefficients.
Okay, we've actually built a speech recognition system.
It's not going to be very good, but it's going to work.
Let's just recap how it's going to go.
We're going to record reference words.
We're going to divide them into frames of 25 ms duration and the shift is going to be 10 milliseconds now.
That means they're going to overlap and the reason for that is that in the signal processing - as you've hopefully realised by watching the videos - when we extract a frame of speech like this, the first thing we do before we do any signal processing such as the Fourier Transform, is we apply a tapered window to fade it in and fade it out to avoid edge effects.
If you don't know why, then going means you haven't watched the video!
and that means that we're gonna lose the information right at the edge as it's going to be just faded away.
We're gonna have lots of overlapping windows of speech and they're going to be spaced 10 milliseconds and each have a duration of 25 seconds.
We'll do that for the templates: we'll get a sequence of vectors, and that will be a template.
We'll store it with a label attached.
We'll do that for all of the words we're expecting to recognise - so maybe the 10 digits
we'll then have our unknown word and we'll do exactly the same procedure
we'll divide it into frames, will extract features and stack them up.
We'll, then realise that, in general, they're not of the same duration.
So we're going to then apply linear stretching.
So this linear stretching to make the same duration
we're then going to add up all the local distances.
All these local distances between pairs of frames and that local distance is just going to by this thing D
And D is going there just this very simple geometrical distance, and that is a complete speech recognition system.