Exemplars and Distances

We start to look at the concepts of distance and alignment between sequences of speech data

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN AUTOMATIC TRANSCRIPT WITH LIGHT CORRECTIONS TO THE TECHNICAL TERMS
So in this module, we're not going to develop any fancy probabilistic model of doing that.
We're going to do something very, very simple.
We're going to do pattern matching, but we're going to do this old fashioned method for a very good reason because it helps us understand the dynamic programming algorithm, which we're going to need again when we do to hidden Markov models.
So if I take a sequence of feature vectors for a recorded say, spoken word that I knew the label for, I'm gonna call that an exemplar: a stored example with a label.
It looks like this.
I've got somebody to record the word 'three'.
I've got the waveform and from that I've extracted the feature vectors, and that is what I'm going to call an exemplar.
If I want to do speech recognition for just the digits, I'll store one exemplar for each digit that I want to recognise.
So 1, 2, 3,... and so on.
I'll store them and all I will store other sequences of feature vectors,
So that's an exemplar of the word three, and we can throw away the waveform.
I'm going to restrict everything to whole words in isolation.
So we're gonna build on isolated word recogniser.
That sounds a bit restrictive, but it won't be.
It's actually easy to generalise that.
But we won't do that generalisation until we get to hidden Markov models.
For the purposes of Dynamic Time Warping we'll completely restrict ourselves to isolated whole words.
And let's just assume we're doing digits, to keep things simple.
So I've stored an exemplar of every word that is in my vocabulary and I'm going to have an incoming speech signal from which I extract a sequence of feature vectors
I would like to put a label on that: I would like to do automatic speech recognition to it.
I'm going to do that by measuring the distance between that incoming unlabeled sample and all of my stored exemplars.
So when you think about how you measure the distance between two sequences of feature vectors, here's my exemplar.
So an exemplar is a label on a sequence of feature vectors that's stored inside my speech recogniser
and I'm going to measure the distance between that and some unknown incoming speech.
So, of course, we do the same feature extraction for that, throw away its waveform.
And now I want to measure the distance between these two things.
The distance is going to be the dissimilarity, so big distance means they're very unlike each other.
They're unlikely to be repetitions of the same word.
A small distance (a small dissimilarity = high similarity) means they're more likely to be repetitions of the same word.
I'm going to use this distance to decide, for some incoming unlabeled sample of speech, which of all my exemplars is closest: which has the smallest distance.
I'm going to put that label onto this unknown incoming speech.
So how could you measure the distance between these two sequences?
The simplest way I can think of is to define it as a sum of local distances.
I will measure how similar the start of this thing (the exemplar is on the top; sometimes we call that a 'template') is with start of the unknown.
I'll measure how far apartment middles are, and how far apart the ends are.
So I'm going to measure local distances, and I'm going to define the total (= global) distance between the sequences as the sum of all of those local differences (or distances).
You hear me using terms 'distance' and 'difference'.
That's the same thing.
The global distance between this exemplar and this unknown is just equal to the sum of these things.
Now how did I know that those particular local distances are the right ones to add up?
Well, I don't yet know that!
What we need to do is form some correspondence (alignment) between the sequence of feature vectors for the exemplar and the sequence for the unknown.
There are many possible alignments.
So the hard problem isn't one of computing distance, because I'm just going to keep a very simple local distance.
For now, I'm going to use the Euclidean distance between the two factors.
A simple geometric distance.
That's easy.
The hard problem is exploring all of these possible alignments and deciding which one is the one to use to compute the global distance as the sum of local distances.
So which local distances are the right ones to add up?
We earlier proposed that we'll simply make this as linear as possible.
So we just stretch it in a uniform way - perhaps like that - and always use that.
That's clearly too naive.
Speech isn't like that.
We want to do something better than that.
We want a dynamically stretch our sequences and what we're going to do is we're going to make sure that the most similar parts in the exemplar are compared to their most similar parts in the unknown.
Compare 'like with like' - it's the only obvious thing to do.
I think so.
The hard problem is finding the alignment from amongst the many, many possible alignments.
Let's just check how my diagrams relate to the way that they're drawn in Holmes and Holmes so that you can relate this to the reading.
This is a picture from Holmes and Holmes.
This is a bit like a spectrogram, but it's actually a sequence of feature vectors (they're very similar to each other).
In this diagram, this is time in frames, and this is the index of the feature vector, which is the same as which filter it is in the filterbank; that's on a frequency scale, ascending in frequency.
Each column of this picture is one of my feature vectors, so this picture from Holmes and Holmes corresponds exactly to my picture.
That's how Holmes and Holmes draw it.
That's how I'm drawing it.
I'm drawing the vectors out explicitly, so we'll use this diagram from now on.