Module 6 – speech recognition – pattern matching

The most basic way to recognise speech is by comparing the speech to be recognised with stored reference examples.
Log in

Total video to watch in this section: 38 minutes

The properties of speech are constantly changing over time, so we need to analyse it in short sections, called frames.

This relates to earlier material on short-term analysis and introduces the idea of extracting salient features from each frame.

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN AUTOMATIC TRANSCRIPT WITH LIGHT CORRECTIONS TO THE TECHNICAL TERMS
So why is a waveform,not a good representation for pattern recognition.
Well, we already know that from what we've looked at in the lab, so speech production has a source of sound.
It might be periodic.
It might be aperiodic, it might be noise.
It might be vocal folds
going through a filter.
And we've already hopefully realised that most of the information about the segment is not in the value of the fundamental frequency.
So the pitch doesn't carry information in many languages, and some it does carry some information, and many it doesn't.
It's mostly in the shape of the vocal tract, so the overall spectral envelope is what we're going to do pattern recognition from
and the speech waveform is a mixture of a spectral envelope.
We remember what the spectrum of a speech sound looks like.
This is frequency.
That's magnitude.
It has some sort of overall shape, and then it has all this fine structure due to the harmonics.
We could generalise that idea and we can model unvoiced sounds the same way that the overall shape and rather than harmonic structure they just got noise structure.
And it's this envelope is this shape, which we're going to do pattern recognition on the shape of the spectral envelope of one sound is like the shape of the spectral envelope of another sound we'll say these sounds belong to the same category.
That's the same phoneme
We'll do pattern recognition on that basis.
So we're going to extract the spectral envelope
now there are various ways of doing that.
If you were thinking in terms of synthesis, we think, right, we've got this fabulous source filter model, so we'll fit the source filter model to the signal.
We'll forget what the source is doing and will just look at the filter coefficients and then speaks emphasis.
Those filter coefficients for a particular form of filter might be called linear predictive coefficients.
They're just filter coefficients, and those would indeed be a reasonable candidate for our representation of speech.
We're not going to do that.
We're going to do something a little more direct, and that's gonna be actually simpler.
Computationally cheaper.
I was going to lead is up to these eventually.
These other features called Mel Frequency Cepstral coefficients.
So we already know speech sounds have a envelope
Spectral envelope on this evolves over time, so it's never, never static.
It constantly changes that some rate because our tongue is moving, articulated and moving.
But in order to do Fourier analysis, we need to make the assumption that over some short period of time, the statistical properties of speech such as the spectral envelope are constant.
And that leads us to something called frame based analysis.
We've already touched on this, so let's just formalise that it's over some short period of time.
Perhaps this amount of time we'll just assume that the vocal tract shape is static.
It doesn't change at all.
So that's the waveform is essentially perfectly periodic just repeats for some short period of time.
A little bit later on, the properties will have slightly changed.
So for some short period of time (we'll call that a frame), we can extract that piece of waveform.
We can perform some analysis on it to extract some features and those features, characterised this bit of speech here, and we'll get this thing here.
This is a vector, so I want you to be comfortable with the notation I'm going to use on the slide.
It's all gonna be pictures, but they represent mathematical things.
This is a vector, and the vector is just some numbers.
Let's write a vector here.
It might be something like this.
Just some vector of numbers.
The numbers that are the properties of this fragment of speech.
I was going to write these vectors as this
This picture here going to see lots of those.
The question is, then what do we extract on? What do we stack up within this vector? How many numbers do we need? How much speech to analyse? How far do we move on between the frames?

Log in if you want to mark this as completed
We will make a first attempt at parameterising each frame, but we'll need to revisit this after learning more about the probabilistic model that will be used.

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN AUTOMATIC TRANSCRIPT WITH LIGHT CORRECTIONS TO THE TECHNICAL TERMS
so we have to parameterise
the speech signal.
So in all of automatic speech recognition, we immediately get rid of waveforms and replace them with features.
The features are going to be vectors per frame
a sequence of vectors.
So let's draw some pictures that represent that so we don't have to use too much heavy duty notation.
There's the diagram for the previous slide.
We've got a frame of speech
over some short period of time we assume that statistical properties are constant static on the sort of duration that we could get away with.
There will be something like 25 milliseconds
and for that little fragment of speech will pull out the waveform.
We'll do some analysis.
We don't know what this analysis is yet
this analysis will yield a vector of numbers.
This vector here.
We don't know quite how many numbers to put in this vector just yet.
And then we'll do that at fixed intervals throughout the word.
So we'll move on to the next frame.
I'll get the next vector and we'll do that all the way through the waveform.
So we'll do that for fragments of speech that are 25 milliseconds in duration and then we'll space them apart about 10 milliseconds apart (100 times per second)
we'll take a little snapshot of spectral envelope
and store it in a vector.
Get the sequence of these factors so that we can draw Nice pictures of these things are good with these vectors.
These things here, we're going to have sequence of vectors.
So each vector corresponds to one frame of speech.
We're going to stack those things into some other data structure, and that's what this thing here is.
So this thing here inside each of these cells is a vector, and the vector is just a set of numbers that characterise the spectral envelope for the speech.
This is just to have some nice compact graphical notation so we can draw pictures of words and show how they're going to be compared to each other.
Okay, we're comfortable with this this way of notating things.
So what is going to go into this feature vector? What do we need to know about the speech signal to do pattern recognition.
So as we've already seen in classification and regression trees, machine learning's kind of magic, but it doesn't do everything for us, and specifically, it doesn't tell us what features to extract in the first place.
It might be able to select the most important ones from a big bag of candidate features.
That's what a decision Tree will claim to do.
But we, as the engineer had to think of the features first.
So there's some feature engineering to do
machine learning doesn't solve this problem for us.
We need to use our knowledge of the problem.
So if this wasn't speech recognition.
if it was image recognition or gene sequencing or natural image processing, we'd use completely different features but we might use the same form of machine learning.
So the feature engineering is where our knowledge of the problem goes.
The machine is going to tell us the parameters of the model.
We've got to choose which model and what features to operate on.
So we need to use our knowledge of speech to decide on these features.
So another feature of machine learning is that it's usually best if the features are compact, only the features that are really useful for recognising the patterns we want to recognise and don't have loads of useless, noisy features.
even decision trees, which claim to be able to sort out noisy features from useless features have limited power in that regard
so if we can engineer the features to be as useful as possible first, we would expect better performance.
We need to capture, obviously the important information.
So that's a That's the spectral envelope that's sufficient for capturing speech segment type.
It was a vowel.
Getting the first two, maybe the first three formants would be a pretty good start for identifying the vowel.
We want them to be somehow invariant to things that we don't care about.
So, for example, having two recordings of the same vowel said at different fundamental frequencies.
We'd like the extracted features to be essentially the same in both cases, for English.
For other languages we might want some features that capture the pitch because that might be segmental information.
But for many languages it's not
We'd like a representation that doesn't vary when F0 varies and the spectral envelope is a good candidate for that, it's essentially independent of it
so we want to get rid of the fundamental frequency most of the time, even in languages such as Chinese languages and some other languages around the world lots of Asian languages where pitch is a feature we might extract it separately and use it as a separate feature would still want the spectral envelope.
You want to get rid of fundamental frequency and it's possible like to get rid of things like the speaker identity because we'd like a recording of me saying one on a recording of you saying one to be matched against each other so we could do speaker-independent recognition.
[The idea of extracting the spectral envelope using a filterbank will be fully developed in the next module. For now, just pretend the feature vector is the FFT coefficients themselves, or formants, or whatever you are comfortable with.]

Log in if you want to mark this as completed
A key step in parameterising speech is to move from the time domain to a domain in which distances make more sense, and so where we can perform pattern matching.

There is no video here. Revise the following concepts from earlier in the course:

  1. The spectrum, obtained by Fourier transform, is one possibility for a feature vector
  2. A better option would be filterbank features, a bit like the cochlea produces
  3. Yet another option could be the filter co-efficients of a source filter model

For now, you can imagine that any of the above are placed in the feature vector for each frame of speech. We’re going to come up with something better later on, but any of the above will be OK for the time being.

Our first automatic speech recogniser stores an example ("template") of each word. Speech to be recognised is compared against each template.

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN AUTOMATIC TRANSCRIPT WITH LIGHT CORRECTIONS TO THE TECHNICAL TERMS
we're going to work our way up
We're going to build these whole word templates
we're going to start with analysing speech in frames
we're going to extract something from individual frames
we're going to build up a sequence of feature vectors for each word that we like to recognise and compare it to some stored ones.
So this is the scenario then, in training the system, there's no statistical model yet
In training the system, we think of all the words would like to recognise and we record one example of each and we save it with its label so we know what it is.
Let's pretend they're the digits.
0 1 2 3 ... to 9
record each of them once, store it in a file, put a label next to it.
So we remember what they are called.
These references often known as templates.
They're just gonna be single examplars
And then at recognition time, we have a recording of a word
we know it starts and where it ends, we're gonna match it against each of the references in turn
we're going to measure the distance to that reference, and then we're going to look at all the distances.
Pick the smallest one and announce that label as the label for the unknown word.
So it's extremely simple form of pattern matching it's not statistical.
It's just based on exemplars.
So this finding the closest match between an unknown thing and various known things is the key process.
So we're going to a measure of distance between one recorded word and another recorded word, and it's these features that we're going to use to measure this distance: this difference.

Log in if you want to mark this as completed
The simplest way to deal with variable duration is to stretch the unknown word to have the same duration as the template.

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN AUTOMATIC TRANSCRIPT WITH LIGHT CORRECTIONS TO THE TECHNICAL TERMS
So with that in mind as our feature vector.
This thing here
we're going to just use the fast Fourier transform to extract that for now from 25 milliseconds of speech and get a vector of some hundreds or thousands of numbers.
Don't worry too much.
We do that every 10 milliseconds.
So in this diagram Now, this is time.
We're going to do it for one word
w're going to do for another word.
And then we're going to get to the crux of the matter.
How do we tell how far apart these two things are? Are they the same word? Are they different words? Different recordings vut the different recordings of the same token, sorry the same TYPE or not?
Okay, so this this idea here of distance is the crucial concept we're going to get to in today's lecture.
Later on, we'll replace this idea of distance with a statistical idea.
The idea of probability or likelihood.
How likely is it that one word is the same as another word?
But today we're taking a non statistical, non probabilistic view.
How far apart are they? Think of it as a difference or distance
the idea's are going to lead naturally, one into the other.
Let's imagine we've got these two recordings of words.
We've got this recording here.
Let's just play one like that
we're going to measure the difference between these two recordings of words.
One Sorry this one.
One - that's me saying one
and I'd like to measure the difference between that and this other recording of me saying this word one one
So we could immediately see there's going to be a bit of a problem here because one of them is longer than the other.
So the first problem is they don't line up properly with each other.
So in general, speech has variable duration.
That duration might not be very informative about which particular word it was.
These are both perfectly reasonable examples of the word one, one is just longer than the other.
So first thing, our pattern matching techniques going to have to deal with is that the exemplar stored pattern and the thing we're trying to recognise, not the same duration.
I have to line them up with each other.
So let's just do that in a really dumb way.
Let's just stretch one of them
So we have a template
We know its label because we recorded it, especially for the purpose of building our system.
It's a template of the word one.
Just remember, each of these little boxes here is really a vector.
It's a snapshot of the spectral envelope.
If you're not comfortable with thhis idea, just think it's the formants at that moment in time and that everything's a vowel.
We've got an unknown word that somebody spoken into our recogniser, and our job is to say, Is it a one or not? So that's our That's our job.
Of course, if this recognise is going to be any use, it doesn't just say Is it a one or not? It's got templates for all the words we might expect.
So we had to plan ahead when we build the system, what words we might expect.
We got templates for All of them we're comparing against all of them.
We're going to do that one time and just remember the distances and then pick the one with the shortest distance.
So we're down to this core problem here, of the distance between the template and the unknown thing and the first thing we're going to worry about is the fact that they're not the same duration and what that means is if the frame rate is fixed at every 10 milliseconds.
So this time interval here is 10 milliseconds.
That's time
there's a different number of frames in one of them to the other one
So it's not at all obvious which frame to compare to which frame.
So we deal with that problem first.
So let's do the simplest thing we can think of.
And that's just a stretch one.
And if we stretch it (not in the waveform domain, as we did there) but in the feature domain in the sequence of frames we'll just line up the frames in this way.
So our unknown one here is a bit shorter.
Just stretch it and then we'll just approximately line them up and we'll say that the beginning of this word here, we're going to compare it to the beginning of the word here.
That seems reasonable.
So we're going to see if the words is the same as another word we'll compare the beginning and the middle and the end
in fact we just compare it every frame
and then this frame here, well, compare it to this frame here.
This frame here will also compared to this frame here, so we'll do that.
This one we'll compare to this one.
These two.
These two; probably these ones, This one and this one
So make some approximate alignment between them.
We've just done that by linearly stretching the shortest one as long as the longest one or linearly stretching the unknowns that fits the template.
So if it's too long we squash it.
It's too short:s stretch it.
We'll just make some approximate alignment between the frames.
Let's imagine that's good enough.
Let's imagine speech just linearly stretches.
If I say one and one, all the sounds just stretch the same amount.
Let's let's pretend that that's okay so you can see the method here, we're just going to pretend things and keep it really simple.
Work out why that didn't work on, then make things a little more complicated in small steps.
So we'll go with linear stretching so linear time warping as a way of lining up quick repetitions of words with slow repetitions of words.
And remember, it might be the case that the unknown and the template are different.
Then in general, they are going to be different durations.
So we've lined up the two things of different durations.
And now what we're going to do, we're going to measure all these local differences.
This frame with this frame, this one with this one, this one, this one This one This one This one and this one
make these correspondences.
So in our real example, here, what we're saying is that once we've linearly stretched things, we're just going to compare this 25 milliseconds of speech about this much 25 ms speech with the bit that just lines up with it here.
This bit with this bit this bit with this bit and so on.
We're going to assume that that's the [w].
And that's also the [w].
And after stretching, they just end in the same place.
That's the vowel.
And that's the nasal.
Likewise here we're going to compare short bits of speech here with short bits of speech here
rather than stretching in the time domain, which is an expensive operation, extract features and then just stretch the frames of the features.
So they line up with each other.
It seems reasonable.
I mean, it would not be reasonable to compare this bit of the word with this bit of the word: that doesn't make any sense at all.
It does make sense to compare the beginning with the beginning, the middle with the middle in the end with the end.
And if they're all similar, then we're going to say the two words are similar and therefore this is likely to be two repetitions of the same type.
The same word
so we're going to do our pattern matching then
we're going to find the global distance between the unknown thing and the template as the sum of all the local differences.
are the beginnings the same? are the middles the same? are the ends the same?
break the big problem down into all these little small problems.
So now the question is this thing here, remember this is a vector of numbers, and this one here is a vector of numbers.
These are of the same length because they are the same type of thing - they're both the fast Fourier transform of 25 ms of speech, or the set of formants or whatever else you want to think of them as
how do we say how far apart these two things are? What's the distance between two vectors
an easy way to think about vectors is in geometry.
So let's pretend these vectors have just got two elements.
I'm just gonna pretend they're two, because my writing surface here had got two dimensions.
I can't draw in three dimensions.
So we just think of them, you know, X and Y
it will generalise: anything we can do in two dimensions we can do in any number of dimensions.
We just find it more and more difficult to draw them.
So we should pretend that those features have extracted have two dimensions (first format and second formant, for example).
And so a really simple distance is just to think in geometry.
So here's X
In any vector, it's got some X value and some Y value
this vector is just a point in this space.
Are we happy with this idea? two numbers just being a point in space?
So we go the Y amount, the Y amount, so there's one point, and there's the point we're trying to compare it to.
So that's this one.
And we've drawn them as two points in the vector space.
We're just going to compute the geometrical distance between them.
We can do it on paper we could plot them on paper and get the ruler out.
Measure it.
How many millimetres apart are they? That's going to be our distance.
That's easy to automate.
It's a simple equation for that, because it's just this nice right angle triangle.
We know that the sum of the squares on the two sides is equal to the square on the hypotenuse, right - happy with this level of geometry?
So this distance between them is just the sum of these difference of these squares.
We take the square root and get some distance.
So there's a distance between two vectors
and this equation happily, generalises to 3 dimensions or 39 dimensions or 1000 dimensions.
It doesn't matter.
Just just expand the equation instead of X and Y.
I've got X Y Z... any number of coefficients.
Okay, we've actually built a speech recognition system.
It's not going to be very good, but it's going to work.
Let's just recap how it's going to go.
We're going to record reference words.
We're going to divide them into frames of 25 ms duration and the shift is going to be 10 milliseconds now.
That means they're going to overlap and the reason for that is that in the signal processing - as you've hopefully realised by watching the videos - when we extract a frame of speech like this, the first thing we do before we do any signal processing such as the Fourier Transform, is we apply a tapered window to fade it in and fade it out to avoid edge effects.
If you don't know why, then going means you haven't watched the video!
and that means that we're gonna lose the information right at the edge as it's going to be just faded away.
We're gonna have lots of overlapping windows of speech and they're going to be spaced 10 milliseconds and each have a duration of 25 seconds.
We'll do that for the templates: we'll get a sequence of vectors, and that will be a template.
We'll store it with a label attached.
We'll do that for all of the words we're expecting to recognise - so maybe the 10 digits
we'll then have our unknown word and we'll do exactly the same procedure
we'll divide it into frames, will extract features and stack them up.
We'll, then realise that, in general, they're not of the same duration.
So we're going to then apply linear stretching.
So this linear stretching to make the same duration
we're then going to add up all the local distances.
All these local distances between pairs of frames and that local distance is just going to by this thing D
And D is going there just this very simple geometrical distance, and that is a complete speech recognition system.

Log in if you want to mark this as completed
This is just a taster of what we'll cover in the main lecture, but do watch it so that you start to understand some of the key concepts.

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN AUTOMATIC TRANSCRIPT WITH LIGHT CORRECTIONS TO THE TECHNICAL TERMS
So we've got a distance measure between one frame and another frame.
We've got an incredibly naive way of deciding which frames to compare with each other, and we can do speech recognition.
We can already do it.
We'll take this feature vector.
This feature vector
we'll compute the Euclidean distance between these things, and then we'll do that for all of the pairs that line up, sum them up and get some overall distance measure
which is the total distance between a sequence and another sequence.
We made an enormous number of very bad assumptions in doing that!
Almost everything we could think of is wrong with the system, so we're going to spend the rest of the course fixing that, bit by that.
But: this will do speech recognition.
It will work to some extent, but it won't be very accurate.
So let's start thinking about what's wrong with the way that we did that
By fixing them we'll eventually end up inventing Hidden Markov Models (HMMs).
First: speech does not stretch linearly.
Stops, plosives, sounds like that don't really hardly stretch at all.
Vowels are very stretchy.
There could be a much wider variance in duration.
So when we compare two things of a different duration we probably want to stretch the 'stretchy' 'parts more and the less 'stretchy' parts less.
Technically speaking!
We don't really want to do that through some sort of knowledge-based procedure, which wouldn't be possible anyway, because in the unknown, we don't know what the sounds are in there because we're trying to recognise the thing.
So we can't say, "Oh, we're trying to recognise this vowel. Let's stretch it a lot."
We don't know it's a vowel yet.
We have to automatically find out how to stretch either the template or the unknown, so they align with each other.
to have something that's automatically doing stretching.
And that's the dynamic part of the time warping.
It's dynamically changing the stretch factor at different points in the template and the unknown word.
So we're going to have to do something that's not linear.
That's different at different points in the word
And this local distance measure?
Well, it's just geometry!
What's that got to do with anything?
In the nine o'clock lecture we saw that it actually does have something to do with something.
And if we take this geometrical measure and start fixing up some of the problems with it, before we know it: we've also invented probability density functions and specifically the Gaussian distribution, which is something very close to the Euclidean distance measure, in some normalised space.
So this Euclidean distance is not stupid or crazy.
It's not going to be ideal yet
So we're going to deal with this problem first, which is the problem of aligning two sequences of different lengths in some optimal way, automatically.
So there's the statement of the problem
just to remind you, hopefully this notation is reasonable transparent.
Inside each of here is a little feature vector, okay?
And this thing here is a sequence of feature vectors,
and that's time in frames - are we happy with this notation - not too confusing?
things are horizontal, things are vertical - we've got to draw some pictures.
So there's some alignment between these things.
Imagine these are the same word (because we're optimists!)
We're going to try and align them the best we can
Imagine that's the template for 'one' and this unknown word happens to be a recording of the word 'one'
so we would like to align all the [w] bits with each other.
And all the schwa in the middle bits with each other and the [n] bits with each other somehow like this.
So we want to align the sounds that are most similar in the two words.
I can't think of anything that's better than that.
We wouldn't want to align the wrong sound with with some other sound.
We wouldn't want to do anything by time reversing it.
That would be ridiculous.
So the alignment has to have certain constraints, and the constraints are going to be that similar sounds in the unknown align with the most similar thing we can find in the template, in some sense, with the constraint that we can't time reverse.
So we can't do this.
We can't swap frames around.
That makes no sense at all.
Speech doesn't work like that.
When we speak faster, we don't start swapping sounds around in words.
So we have to come up with some definition of the correct alignment or the best alignment
We'll first do that.
And then we're going to find the most efficient algorithm we can, to find that alignment quickly.
So the question is "How can one thing compare to two things?"
So we've got this here where one thing aligns with two things
that will become clear hopefully when we get on to the time alignment.
Okay, so we're going to define what the best alignment is.
Okay, we could say something like: every frame in the unknown has to align with at least one frame of the template
and each frame of the template has to align with at least one frame in the unknown
and therefore it one's longer than the other, at some point two things in one of them must align with one thing in the other.
we'll see a diagram that will hopefully make this clear.
But if we align the two, all we're saying is that the distance is...
so there's a frame, there's a vector, here
there's a vector here.
We're taking the Euclidean distance between this and this.
Let's call that D_1
we're just going to add it to the Euclidean distance between this and this.
So they're just the added up in our sequence of local distances.
Now, since we're optimists and since we hope that at some point we will align the unknown word with correct template (hopefully it's in our vocabulary)
we'll align it with the right one and get a small distance
we can define the best alignment as the one that minimises this sum of local distances.
So these are local distances, and we're going to add them all up, to come up with some total distance measure.
We're going to define the best alignment as the one that makes this total distance as small as possible.
What else would you suggest?
To make it as large as possible would be ridiculous!
We'd by definition come up with ridiculous alignments like this one.
So the only sensible thing I can possibly think of that defines the best way to align a template with an unknown is to find the correspondences such that - when you add up all the local differences (distances) - the sum is the smallest possible.
It's as closely-matched as possible
that will make similar sounds in one tend to line up with similar sounds in the other because those will give small distances and therefore the sum of distances will be small.
So you just define the best (or the correct) alignment as the one that gives the smallest total distance: that's the sum of local distances between the two things.
Now we've got to find it.
when we've found it, we can just add up the local distances.
So that's what the rest of lecture is going to be about is: how can we efficiently find the best alignment between these two things?
And there are many, many possible alignments.
Let's draw some of them out.
There's one on the slide there.
Let's draw some other ones.
Here's one
okay and you could go home and write out all of the other ones you could think of.
And you'll see, there's an extremely large number of ways of aligning something with N frames with M frames, some combinatorial huge number.
And the longer the sequences get the worse that gets
If you've got to line something that's 100 frames long, with something 110 frames long: don't do that as homework because you'll be still doing it in December!
So there's a very, very large number of ways of aligning these two sequences, and it gets worse the longer they get.
And so we need to do that efficiently: computationally efficiently.
So we're looking amongst all the different ways of aligning these two sequences for the best one.
We're looking; we're searching.
and this is a key concept in speech recognition, at recognition time.
Searching amongst all the different possibilities for the one, in this case, with the lowest distance.
Eventually we'll be searching for the one with the highest probability when we get to HMMs, but it's exactly the same problem.
And this dynamic programming is going to be an efficient way of searching amongst these many, many possibilities without doing it the dumb way, but with getting exactly the same answer.
So before we go on, I'm going to define some terms because I'm probably going to be a bit loose with some terminology.
So let's just define what we mean by some of these terms.
You'll see different terms in the literature being used.
If we've got two vectors (two feature vectors) and we compare them
When we're talking about Euclidean distance, we might call this thing a distance, and it has a geometrical interpretation.
It's the distance.
If we plotted the vectors on the diagram, took a ruler and measured the distance between the two points, that's the distance: we will get the Euclidean distance, so I call it distance.
Small distances mean things are similar to each other, and the distance of 0 means they're identical to each other.
We have plotted our things on some vector plot.
When we had two vectors like this, they'd have a very small distance.
And as it goes to 0 they end up being identical
Very often, in the literature, we might refer to that in some other way.
We might call it a cost: it behaves in the same way as distance, big distances mean things are very dissimilar - it's bad: things don't match very well.
We could say that's 'high cost' or it's 'expensive'.
So these two terms, 'distance' and 'cost' are sort of interchangeable.
High cost / high distance: things very dissimilar, things are very far apart.
When we get on to probabilistic modelling, we'll switch our terminology.
We'll talk about things like probabilities and likelihoods.
They behave the other way around.
High probabilities mean things are similar to each other, means a data point is likely to have come from a particular distribution.
It's very close to the mean of that distribution.
So although both of these terms are fine bigger is better in probability and smaller is better in distance and cost
so we're going to start with these ones and later we're going to change.
So I'll try stick to using the word 'distance'.
But I might occasionally accidentally say 'cost', but they're going to mean the same thing
Let's write down the simplest, most stupid algorithm that will work: will give us the right answer but be very expensive to compute.
It will take a lot of operations: it will be slow
and then we'll get some insight into that and see that we could have computed exactly the same thing with no approximations in a much more efficient way.
so let's draw this diagram.
So you worked out by now, I like pictures, and as long as the notation's not confusing, the picture will help you understand.
But if the notation's confusing, of course, it will be worse than not having a picture.
So let's make sure we're all happy with the notation we're going to use here.
We're going to align two things: a template (I've drawn it along this axis of this square) and the template has got 9 frames.
Let's just spell out what the template really looks like
it's got feature vectors in sequence like this, and there are 9 frames
and that's time, but I've drawn it this way on the diagram
Along the other axis have drawn the unknown word, and that's got 7 frames.
So that looks like this.
And there are 7 frames in that unknown
the question is: find all the different ways of lining these things up with each other and from amongst all of them, find the one that - when you add up each of these local distances and get your big distance - that's the smallest possible
with the constraint that you need to do things in order.
So you need to go through this one in order and through this one in order, and you can't do any time reversal or anything like that.
And the other constraint: let's put another simple constraint in: the first frame has to align with the first frame and the last frame has to align with last frame.
We have to do that, otherwise we'll miss those frames out.
We gotta account for all the frames.
We've got these little constraints.
So that's what this diagram represents.
Each of the points in this diagram - take this one here - represents an alignment between one frame in the template on one particular frame in the unknown
it's one particular pairing of frames.
It says that this point here represents the alignment of frame 6 of the unknown with frame 3 of the template.
Let's just let's draw it out longhand.
There's 9 frames (badly drawn) of the template.
Here's 7 frames of the unknown and this point here says "Frame 6 of the unknown.
(123456), this one lines up with frame 3 (123) of the unknown.
It says these two things line up.
[WE'LL COMPLETE THE ALGORITHM IN CLASS...]

Log in if you want to mark this as completed
To be continued in the lecture…!

Here is a link to a recording of the class.

Find the recording in the General channel on Teams, or via this link.

Reading

Holmes & Holmes – Chapter 8 – Template matching and dynamic time warping

Read up to the end of 8.5 carefully. Try to read 8.6 as part of Module 7, but rest assured we will go over the concept of dynamic programming again in Module 9. We recommend you should skim 8.7 and 8.8 because the same general concepts carry forward into Hidden Markov Models (again, we'll come back to this in Module 9). You don't need to read 8.9 onwards. Methods like DTW are rarely used now in state of the art systems, but are a good way to start understanding some core ideas.

Jurafsky & Martin – Chapter 9 introduction

The difficulty of ASR depends on factors including vocabulary size, within- and across-speaker variability (including speaking style), and channel and environmental noise.

Jurafsky & Martin – Section 9.1 – Speech Recognition Architecture

Most modern methods of ASR can be described as a combination of two models: the acoustic model, and the language model. They are combined simply by multiplying probabilities.

This is a SKILLS tutorial about mathematics.

Work through the foundation material on mathematics and probability. If you know everything there already, that’s great – you could skip this tutorial.

The focus of this tutorial is to get all students up to the minimum level required for the remainder of the Speech Processing course. It will focus on the basics, so don’t worry if you haven’t “done” maths for a while: this tutorial is for you!