Feature vectors, sequences, and sequences of feature vectors

Representing speech as a sequence of feature vectors

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN AUTOMATIC TRANSCRIPT WITH LIGHT CORRECTIONS TO THE TECHNICAL TERMS
So what do we mean by a feature?
Well, one possible set of features for speech recognition would be the raw waveform samples themselves, but have already dismissed that as a representation that's not particularly useful.
It includes phase and other things that we're not interested in
An alternative will be what we're seeing here, which is to discard phase and just look at the magnitude spectrum
that's already better than the waveform because it's invariant to phase.
Remember what we're seeing on this picture.
This is the output of the Discrete Fourier Transform.
There are bins, here, up to the Nyquist frequency
That set off numbers could be a set of features for doing automatic speech recognition: the raw DFT features, but just the magnitude and not the phases.
But we also know that this includes information that we're not interested in (in general) for doing speech recognition.
We can see here that we can resolve all the harmonics of F0.
In general, F0 is not a useful feature for speech recognition.
What we would like would be something that's this envelope.
So, and even more useful feature than a DFT bin would be the output of one of this bank of filters.
So I've just drawn the filters.
It rejects everything, then it passes its band with some response shape: here it's triangular.
What is the output of this filter?
If we put the speech signal through this filter, it would pass through all of these frequencies and reject everything else.
What we're interested in is summing all of the energy in that band, summarising that as a single number for a particular analysis frame.
It is the amount of energy in this frequency range and in that analysis frame.
That's what we mean by the output of a filter.
So each filter's output will be a useful feature for doing automatic speech recognition - even more useful than the DFT bins themselves.
So we've got a bank of filters and each filter, for each analysis frame. summarises the amount of energy in its pass band and writes it out as a number
what we're going to now you're going to just store them into a vector.
So here's a speech signal: it's in the time domain
it's going to always be easier to think about everything in the frequency domain.
So let's draw the magnitude spectrum.
What's this? this is the magnitude spectrum for some analysis frame, so we've taken some section of speech signal, cut it out, we've probably applied a tapered window, and we've taken the DFT to get the magnitude spectrum.
So that's frequency.
We're going to extract from this magnitude spectrum a set of features.
The features are going to be the amount of energy falling inside each of the pass bands off a bank of overlapping triangular filters
of some number of filters that we have to choose - but it's going to be something like 20 to 30 typically
we're going to store each of those in a vector.
The first one is going to produce some output.
We'll store that in the first element of the vector, and then the next filter will summarise its energy into some output and we store it in the second element of the vector, and so on...
Each vector stores the features for a single analysis frame: this analysis frame here.
Now, speech changes its properties over time, and so we analyse it with a sequence of such analysis frames.
So let's just talk a little more generally about this idea of sequences, because it's going to be the core problem we have to solve eventually
Sequences are everywhere in language.
A waveform is a sequence of samples.
We can analyse that sequence of samples with a sequence of overlapping analysis frames
There is some correspondence between a frame and the samples that fall inside it.
Back in speech synthesis, we could think about sentence being a sequence of words, a spoken word being a sequence of phones or a written word being a sequence of letters.
Now we've gone from a waveform being a sequence of samples to being a sequence of overlapping analysis frames
and each frame giving rise to a feature vector that's extracted from it.
So a waveform has become a sequence of feature vectors.
Now, what do all these sequences have in common?
Well, they're all of variable length.
There'se some alignment between sequences at one level on sequences at another.
For example, phones aligned with words.
But it's not trivial because the number of phones per word is variable.
Or we might also want to align sequences of the same type.
If we're building a spelling correction system, we might need to align two letter sequences: the sequence of letters somebody typed and the sequences of letters of all the words in our dictionary and compute something called the Minimum Edit Distance (or the Levenshtein Distance) to find the word that's closest to what they typed and correct it.
We might have the output of a speech recognition system, which gets some words right and some words wrong.
We'd like to compute its accuracy, or its word error rate (WER)
That's something we'll need to do later in the course.
And that will involve aligning two word sequences, which are potentially of different lengths and have a nontrivial alignment.
What we're going to do here and now is going to align to speech signals represented by sequences of feature vectors.
We're going to measure how similar they are.
We're going to use that to do pattern matching and therefore to label unlabeled sequences with the label of the closest stored pattern that we already have.
In spoken language processing, most of the time, the alignments between these sequences are monotonic.
That is, as we advance through, for example, a sequence of words, we also advance through the sequence of phones that correspond to that sequence of words in a left-to-right fashion.
We don't backtrack; there's no reordering.
But the alignment isn't linear.
We don't just move proportionally through the two sequences.
It's nonlinear; it's dynamic.
For example, the number of phones in the word varies word by word, and this dynamic alignment is also going to be true when we're aligning two sequences of feature factors.
So let's get back on track after talking rather abstractly about sequences.
What we really care about here are sequences of feature vectors, so let's see how that works.
We've already seen that from a single analysis frame we extracted a magnitude spectrum.
We put a bank off triangular filters on it and we write the outputs of those filters into feature vector and store them
that gives us one feature vector for one analysis frame
and then we'll just move the analysis frame forward on repeat and move it forward and repeat in the usual way
I'm going to start doing my vectors vertically because I'm going to have a sequence of them and it will be clearer
so we'll have a sequence of analysis frames giving rise to a sequence of feature vectors.
This analysis frame: we go to the magnitude spectrum and then we do some filterbank analysis and arrive at a feature vector
The next one, next one, and so on...
This is the first step in almost every automatic speech recognition system that's ever been built
to get away from the waveform and extract from it salient features, which are at a frame rate.
So they're every, for example, 10 ms instead of the 16,000 times per second in the time domain.
So: a much lower rate in time: far fewer of them
And they have a relatively low dimension.
So we've done the first and most important step in every automatic speech recognition system.
Get rid of the inconvenient waveform!
Distill it down into a sequence of feature vectors.
And that is the representation we're going to work with from now on.