Cochlea, Mel-Scale, Filterbanks

From human speech perception to considerations for features for automatic speech recognition

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN AUTOMATIC TRANSCRIPT WITH LIGHT CORRECTIONS TO THE TECHNICAL TERMS
This is the start off the speech recognition part of the course, and we're on a journey towards a model called the Hidden Markov model (HMM).
We're gonna start with something much simpler than a model, something called pattern matching
That's in this module.
We'll then move on to thinking about the features that we extract from speech a bit more deeply.
We will do some feature engineering and then finally will get to the hidden Markov model, which is a probabilistic generative model.
Along the way, we're going to learn an algorithm called Dynamic Programming, and we're going to see that now in the form of Dynamic Time Warping.
We'll see that when we move from pattern matching to a probabilistic model, we might have specific requirements for the features that we extract from speech.
Some interaction between the model we use and the features that we choose.
That's why we need to do feature engineering.
Then we'll see Dynamic Programming again for the hidden Markov model in the form of the Viterbi algorithm.
You already know that the waveform is not the right representation for pattern recognition.
For example, it includes phase information, which we know isn't particularly useful for discriminating, for example, one phone from another.
You have a reasonable idea of this concept of a feature vector, but we're going to develop that a little bit more in this video.
We're going to throw away the waveform and replace it with a sequence of feature vectors.
We're going to use the sequence of feature vectors for pattern matching, using a very simple method with whole word templates.
But even this very, very simple and outdated method that we're going to talk about now already has to tackle the fundamental problem of dealing with sequences of different lengths.
So you know, the way form is not a good representation because it combines source and filter, and it includes phase.
We only want the filter, so we want something like the spectral envelope.
We know that speech waveforms change over time and that we can use short-term analysis to deal with that on.
We're going to extract a feature vector from each frame in a short term analysis of the speech wave form
and to deal with this problem of sequences of different lengths, we're going to need an algorithm
the core of the algorithm is going to be about finding an alignment between two sequences.
In the previous videos, we suggested that linear time warping would be one possible solution.
Clearly, that's too naive.
Speech doesn't behave like that.
And so now we're going to develop the real algorithm, which is going to be a dynamic or non-linear time warping to align two sequences of differing lengths.
So here's what we're going to learn.
It divides into two parts.
In the first part, we're going to extract features from frames of speech, and the destination of that first part is a sequence of feature vectors to replace the waveform, and that will be our representation of speech for doing pattern matching.
In the second part, we'll take that sequence of feature vectors, and we'll match one sequence that we know the label for against another sequence that we're trying to put a label on.
In other words, we're trying to do speech recognition.
We'll find the distance between those two sequences of feature vectors.
Let's start on the first part.
In the beginning of this course, when we talked about signal processing and how speech was produced, we took some inspiration from speech production, and that led us to a source filter model, which we could use to explain speech signals and could be the basis for speech synthesis and all sorts of other things.
That's still relevant.
But also relevant would be some understanding of speech perception.
So let's start with that, because we're going to take some inspiration from human speech perception to do feature extraction for automatic speech recognition.
The cochlea is part of our hearing system.
on this diagram, it's this part here
the cochlea acts like a bank off filters that respond to different bands of frequencies in this incoming audio.
Of course, sounds coming isn't always speech, but we're only interested in speech here
so sound comes into the ear and it ends up in the cochlea.
In our bodies, the cochlear is coiled into a spiral.
But I is just to save space because we need to leave more room in ourhead for our brain!
And so to understand that, we don't need to worry about it being a spiral.
We can just draw it out as a straight structure like this
so sound comes into the ear is a sound wave, and it propagates down the cochlea
on different places along the cochlea (this is place) respond to different frequencies in the incoming sound wave
they're effectively little resonating filters, each of them responding to a narrow band of frequencies
in some parts in the cochlea, those little resonating filters respond to high frequencies
On the other end of the cochlea, they respond to low frequencies
so we can understand the cochlea, which is the device in our bodies that converts sound pressure waves into nerve signals to be sent off for interpretation by the brain, as a bank of filters
the cochlea spreads frequency along its length: along place
One of the most important features off this bank of filters in the cochela is that it's not spaced linearly on a Hertz scale
it's spaced quite non linearly.
There's lots of mathematical functions that are used to approximate that non-linearity, the most common one in speech technology is called the mel scale.
the male scale looks like this
as we linearly increase in Hertz, the mel scale is compressive.
It curves this way.
So, for example, the difference in Hertz here at these high frequencies is a relatively small difference in mel
but down here, the same difference in Hertz is a much larger difference in mels.
It is this compressive nature of this curve that's important.
What that means is that human hearing is less sensitive to frequency differences up here in the high frequency range than in the low frequency range
we're going to use that when we extract features from speech signals for speech recognition.
We're going to extract more (= denser) features in this sort of frequency range, which happens also to be where all the information in speech is
and fewer (= coarser) features up in the higher frequency range.
So the cochlea is like a bank of filters, and they're spaced non-linearly in Hertz or linearly on some perceptual scale, such as the mel scale.
We can use that along with our knowledge of the special envelope, to start extracting some useful features for automatic speech recognition.
So let's see how a mel-scaled filter bank could be used to extract the spectral envelope from a speech signal.
So here, when I say auditory system, I really just mean the peripheral part, the cochela: the bit that's converting sound waves into nerve impulses.
We're not talking about any processing in the brain.
Let's remember what a band pass filter is.
It's something that has a lower frequency limit and upper frequency limit, and it extracts those frequencies from the input signal.
It's easiest to draw in the frequency domain, where it rejects everything outside of its band.
It accepts everything inside its band.
That will be an idealised, perfect rectangular bandpass filter drawn in the frequency domain.
The cochlea is like a set off such filters: a bank of filters
and they are spaced along a mel scale so they get further and further apart in Hertz at higher frequencies and their widths also get wider.
So a simplified idea of what the cop theatres will be that there are banned past philtres like that.
That's one of them, and then another one a little wider
and then another one. Why do still and so on?
I've just simplified the cochlear as only having four bands.
Of course, it's got many more from that
but we can see that the centre frequencies of these filters get further and further apart as we go up
entire frequencies.
and the width of these filters also get wider bandwidth.
these are bandpass filters: they pass through a band of frequencies.
Now these idealised rectangular bandpass filters aren't actually possible physically: the cochlea isn't like that.
the cochela has a set of overlapping filters that have a more realistic shape
A slightly better approximation than a bank of rectangular filters will be a bank of triangle filters that overlap.
So we might have filters that look like this: getting wider and wider and further and further apart.
And the spacing here is on a mel scale.
Again, I've just drawn four filters; we're going to use more than that to simulate what the cochlea does.
So we've got a simplified model off the cochlea as a sequence of triangular filters: a bank of triangular filters.
We're going to use that now to extract features from speech way forms