Whole word templates

Our first automatic speech recogniser stores an example ("template") of each word. Speech to be recognised is compared against each template.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN AUTOMATIC TRANSCRIPT WITH LIGHT CORRECTIONS TO THE TECHNICAL TERMS
we're going to work our way up
We're going to build these whole word templates
we're going to start with analysing speech in frames
we're going to extract something from individual frames
we're going to build up a sequence of feature vectors for each word that we like to recognise and compare it to some stored ones.
So this is the scenario then, in training the system, there's no statistical model yet
In training the system, we think of all the words would like to recognise and we record one example of each and we save it with its label so we know what it is.
Let's pretend they're the digits.
0 1 2 3 ... to 9
record each of them once, store it in a file, put a label next to it.
So we remember what they are called.
These references often known as templates.
They're just gonna be single examplars
And then at recognition time, we have a recording of a word
we know it starts and where it ends, we're gonna match it against each of the references in turn
we're going to measure the distance to that reference, and then we're going to look at all the distances.
Pick the smallest one and announce that label as the label for the unknown word.
So it's extremely simple form of pattern matching it's not statistical.
It's just based on exemplars.
So this finding the closest match between an unknown thing and various known things is the key process.
So we're going to a measure of distance between one recorded word and another recorded word, and it's these features that we're going to use to measure this distance: this difference.