Module 7 – speech recognition – feature engineering

To get the best out of machine learning, we can prepare features that reflect our knowledge of the problem, and suit our chosen model.
Log in

Total video to watch in this section: 16 minutes

We are going to use Gaussian pdfs, and that places some requirements on the properties of the features that we will model.

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS
No.
Now we're going to do what we said we would do at the beginning.
We're going to engineer features that don't have this property.
They're okay and have this property.
Or indeed, if we're really lucky to have this property.
But in general, this is good enough because here Sigma is a vector, and that's just linear in the number of dimensions.
And Mu is a vector that's always a vector of the same number of dimensions.
So this case is good enough.
So we're going to try and engineer features that have this property.
Now.
What of our features? Bean so far are very naive features.
Words When we go and get a bit of speech 25 seconds of speech, we transform it into a vector features.
We've been pretending that those features are just the magnitude spectrum like my new spectrum might have thousands of points in it.
And so a spectrum that's frequency, and that's the magnitude.
Let's just assume it's smooth.
Forget about harmonics from an imagine The spectrum looks like this, so I've eaten a feature vector is just the values that describe this curve, and those are written into a vector that goes in there, that one goes in there.
So our feature vector is the magnitude spectrum in the spectrum.
The energy at this particular frequency here tends to go up and down at the same time as the energy just below it and just above it, the highly correlated.
They're not completely independent because this is a smooth, continuous care of spectral envelope.
So there's a highly co variant.
So the magnitude spectrum exhibits this bad property of co very ums.
So we're going to get rid of that.
At the same time, we're going to do a few of the nice things to it that will make it even smaller and even less correlated and therefore a better fit to the galaxy in Gaza, probably density function.
So commence is a problem because it increases the number of parameters in our calcium as the square of the dimension of the calcium.
That's a very bad property.
Anything that goes up with square of something doesn't scale very well.
And the more parameters that are more data will need.
Data is always sparse.
It was always too little.
Data were always pushing our models the limit of the data, right So let's plug that in and replace the local distance measure with this probability measure.
That's good.
So it's better because it accounts of variants.
However, it's got this bad property that we must have independent dimensions in a feature vector.
No correlation.
We're going to get rid of some correlation, so these FFT coefficients or the magnitude of 50 coefficients, are no good.
They co vary because energy in one region of the spectrum tends to all go down together.
So neighbouring frequency Benz go up and down together.
So we're going to get rid of that co variance.
We're going to perform some transformations on the vector, Teo de Correlate its dimensions.
So we're onto another core concept, and this is perhaps a little bit tricky.
So let's try and motivated carefully because we go along.
The thing we're trying to do most of all is to come up with a feature.
Vector has two nice properties.
One.
It's a few dimensions as possible but still captures all the important information.
If we can use 12 dimensions instead of 1000 dimensions, that's great.
We'll need less data to estimate the model.
The model will be small.
It'll be faster.
It's a wind situation so well, vectors are small as possible.
And second, and just as important, is that the elements of the vector do not co.
Very so in a statistical sense, across lots of different frames of the data.
As one goes up and down, the others don't systematically go up and down with it in sympathy.
They were going down independently, so all the all the dimensions independent.

Log in if you want to mark this as completed
Those requirements can be imposed by a little clever engineering.

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS
we're going to motivate it, actually, by thinking about human perception.
So as we do this as we do this shrinking of the dimensionality, this D correlation, we're going to fold in some other nice tricks on a fold in some knowledge of human perception to try and throw away information that we know it's not important.
So if you've got 1000 bins in our FFT Magnitude spectrum were trying to squash that down to, let's say, 12 numbers, what do we throw away? How do we decide what to throw away? Think about the human perceptual system.
So the particularly in the air, the cochlear does all sorts of non linear transformations on the system.
One particular thing that it does is that intensity.
So the perceptual correlated, loudness.
Loudness is not linearly proportional to the amplitude signals.
So if these way forms here who make them half the size, they're not.
Just half the loudness is a very normal in a relationship with beating the energy in the way form on the loudness on that, so that we can hear very quiet things and very loud things with this compressive function in a hearing system.
And it's something like taking the longer So it's a non linear, compressive function.
Things have to have many times as much energy just to sound twice as loud.
So let's try and build that in.
That's probably important.
We have a limited range in our frequencies, so maybe this applies to you.
Not sure any longer applies to me.
We have a lower limit on an upper limit on what we can here below the lower limit.
We'll just here in individual pulses will start hearing things in the time domain above a high limit.
We just won't hear too tall a little hair cells, not Cokely will have died off.
And if you don't want hair cells of the coffee, or are they going to do some readings? Readings so we can limit the frequency range, which we analyse and in fact, 16 killers is going to be too high? There's not much information about eight kilohertz, so we're just limit our analysis.
Just eight kilohertz will represent nothing about that.
Well, a lower lower limit will just be quite small.
20 hats or close to zero.
Next important thing is that our ability to discriminate between frequencies rapidly deteriorates the higher frequencies.
So I play you too pure tones and ask you other the same or different.
You could do a very good job at low frequencies that could be very close, and you can hear the difference.
But high frequencies.
They have to be much further apart.
Tens, possibly hundreds of hurts apart before you can say that they're different.
So this non linearity in the frequency scale on this non linearity in the Amplitude Scout.
So vain, nonlinear.
That's so we can concentrate on the important energies in the right frequency range on the right loudness range.
It's going to build these things, and as we derive our low dimensional feature vector, so we'll use these to motivate.
We're not going to build a model of human hearing.
We're not going to literally model the cochlear with fancy mathematical model.
We're just going to do something that exhibit similar properties to this, so we'll limit the frequency range eight kilohertz.
And so what's the sampling rate of the way for me to be if we want to analyse things up, eh kilohertz? £16 think Nyquist frequency.
So what does that mean? It means whenever we record speech for the purpose of doing automatic speech recognition.
It's highly likely we're going to convert it to a sampling rate of 16 killer before we do anything else.
So when Google do their subtitling on the YouTube videos, the audio track, the people that load might be 44.1 kilohertz, a standard rate to go along with the video down sample to 16 kilohertz, do the picture extraction and do the speech recognition.
I won't do it directly from the original soundtrack We're going to build in this frequency.
Discrimination thing gets worse.
Frequent increases we're going to build in this amplitude compression the fact that really allow things get squashed to be less different in loudness.
Okay, so how do we do it? Well, we limit the sampling rate.
16 kilohertz was on the slide.
If you read ahead, we're gonna walk the frequency scale we're not going to use hurts anymore.
We're going to use something that's a bit like the frequency scale in Qalqilya.
Are we going to normally compress the amplitude, for example, by taking a look around

Log in if you want to mark this as completed
The filterbank is the first step in feature engineering: it warps the frequency scale and removes F0.

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS
How are we going to walk the frequency scale? We're going to put the speech through something who's outputs and non linearly spaced along the frequencies.
So if you think about what the cock clear does, this is a little snail little spiral thing inside the ear.
It just lays out frequencies along the physical access that's called tone atopic.
Place frequency representation in the higher frequencies.
There are fewer and fewer hair cells per bit of bandwidth.
There's less and less resolution.
We're going to simulate that in a very crude way.
This is the spectral domain.
That's frequency.
It hurts.
So it's a linear frequency scale.
We're going to put some philtres and downcast philtres.
So cast your mind back all the way to perhaps the second lab session where you put speaks through some bound pass philtres and just listen to the energy and narrow bands.
We're going to put band Pass philtres that gather the energy from little regions of the frequency scale.
So the first possibility just rejects everything and then collects it in this narrow range.
Down here, it sums up the energy and pushes it out as a single number.
This is for one frame of speech, 1 25 minute second frame of speech.
And as we got the of the scale, we're going to make those philtres wider and wider.
So the next one higher up the scale might be this one here that's going to gather energy across a wide range of frequencies.
But the high end of the scale and the spacing here wider and wider because we got the frequency scale.
We're going to space that on some perceptual scale, something that's inspired by measurements people have made on human hearing.
And so one scale is called the mail scam.
There are other scales you could use.
The other popular one is called the Barksdale.
Very similar.
They all have the same property.
They just get wider part of frequency scale, nonlinear.
So that walked the frequency scale that's that done.
It also does something else.
It's extremely useful.
These philtres We're going to be wider than the spacing between the harmonics.
What speech looks like? Got an overall envelope and it's got these harmonics.
These philtres are going to be wider Damn F zero several times f zero.
So they're going to gather together, for example, this range of frequencies and so in the outputs will have smoothed the way all evidence of zero.
So we'll be capturing the spectral envelope.
So it's built about those multiple jobs.
It walks the frequency scale.
And it does every cheap, efficient way of getting the spectral envelope out on removing all the evidence of F zero, which we don't want the speech recognition and that does that simply by being philtres, being wide enough to smear away.
Think of it as just blurring away F zero averages across Rangers.
That means we don't have the resolution to CFCs.
So we plot these philtre bank outputs.
What we'll see is a crude version of the spectral envelope, the nonlinear frequency scale, and it won't go up and down fast enough to be able to capture of Sierra removed.
Spectral envelope has got special envelope on its warp frequency scale on the scale, we might use me something like this Mel scare.
What we've got to so far is that instead of simply using the 50 magnitude spectrum was our feature vector.
Next come do that.
It's not quite right yet are the outputs of these philtre bank, so we just put those margins just for this is just going to be a four dimensional vector.
These four numbers and now what we're going to do pattern recognition on already better than the magnitude spectrum because A they don't have any evidence of zero.
So the independent of the underlying pitch of the speech on to there on a nonlinear walk frequency scale.
And so they put Mohr coefficients more importance.
Lower frequencies, which is where there's more information in the speech signal, and they have a very crude representation of higher frequencies.
So things like cricket IBS so we can capture the difference in certain sure, but nothing much more than that to these high frequencies.

Log in if you want to mark this as completed
After the filterbank, the final step is to take the logarithm of each filter's output and then apply a deorrelating transform.

Currently there is no video clip for this step. Use the readings to begin your understanding. We’ll look at these two important steps in detail in the lecture.

A sketch of the complete process.

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS
We talked about different ways of recognising patterns, and we talked about the simplest way we could think of, which is too much things We need to label the things we've already labelled.
And those things were already labelled.
We call templates or reference patterns.
And so we got this idea of comparing two things.
Those things of sequences we went through, a number of steps of thinking about how you might compare two sequences.
We came up with this idea of dynamic time warping, which is that we need to stretch them by different amounts to find a nice alignment between them.
And then we realised that the features that we were using might not be very good, And so we develop better features, and that's how far we've got.
Rather messy but colourful diagram here is taking the speech signal at the top notice that we are labelling our axes.
There was something in the feedback about that we're taking this way for for each frame, so we're taking 25 millisecond frames.
We're applying a tapered window, so we get rid of edge effects and don't have discontinuities.
We get window frames, we take their spectrum and we saw that fazes perceptually, very unimportant.
So we'll get rid of the phase and just take the magnitude spectrum.
That's a good set of features, except that it's got evidence of zero in it.
These harmonics, this line structure.
So we're going to smooth out, away.
We just been to blur it, just like looking through a blurry lens on the way we do that is with this philtre banks of these Philtres a wider than the spacing between any two harmonics of F zero.
So the width of the philtres is several times bigger than fundamental frequency on.
While we're doing that, we can play another couple of nice tricks, weaken space them further and further apart of the frequency range to simulate what happens in human hearing.
We might do that on a scale which is the male scale.
On the outputs of these philtres will be a nice, simple, low dimensional representation of spectral envelope walked onto a male scale within newsome amplitude range compression.
Another thing that human hearing system does by taking the log will get this log magnitude spectrum on a male scale.
These are the outputs of this philtre bank.
This set of features would actually be a really good set of features to do speech recognition.
But if we're going to fit Gaussian distributions to it, we need to get rid of the correlation.
So this is represented by the operas of Philtres.
So discrete points here, maybe 20 25 of them and adjacent philtre bank outputs are highly correlated.
So if you fit a galaxy into them, we're gonna have extra parameters.
That model that Cove Arians.
We don't want to do that because the number of co variances scales like square of the dimension of the features.
And that's a very bad scaling.
So we d correlate these features.
If we were not using Garcia.
If we're using some other model, for example, were pushing these things through a neural network to do classifications, we could just stop here.
So these are good features.
They just they haven't inconvenient statistical property of Carl Correlation, but we don't want to have parameters for our model, so we d correlated through the magic of the discreet coastline transform that might get called other things.
It's very much like a Fourier transform that just captures the shape in a set of coefficient.
Think of it as a sort of shape coefficients, and the shape coefficients gets tapped into a vector with magic features called Mel Frequency Capital Coefficients.
And these are the features we're going to do speech recognition with when we're fitting calcium to these features.
Each item in here represents something about the shape of the spectrum and is essentially not correlated with its neighbours.
So we can fit separately guardians to each of the elements in this vector or a Multivariate Gaussian with no co variance parameters.

Log in if you want to mark this as completed
Here is a link to a recording of the class, along with written notes for the main questions asked in-class.

Find the recording in the General channel on Teams, or via this link.

Here is a summary of the main questions asked in the live class. You may need to use the pop-out link to see the full text.

How exactly do MFCCs separate source and filter?

The cepstrum “unpacks” / “spreads out” / “separates” the source and filter along the quefrency axis, making it easy to draw a line between them. Truncation of the cepstrum results in only retaining the filter. Although MFCCs are not the true cepstrum (because of the mel scale and the filterbank), they have this same property.

If the series expansion reduces covariance between MFCCs, then does the DFT also result in reduced covariance in the spectrum compared to the waveform?

I cannot find any literature or empirical evidence to answer this question. What we can say is that the Fourier series expansion of the DFT “unpacks” the information in waveform samples and lays it out along the frequency axis so that we can – at least visually – attribute aspects to the source (harmonics) and filter (envelope). The cepstral cosine series expansion “unpacks” the spectrum and lays it out along the quefruency axis.

Why not use an explicit source-filter model (with LPC filter) instead of MFCCs?

A very sensible proposition – we could, and people did in the past. Before MFCCs, this was the dominant approach. Solving for the co-efficients of the difference equation (given a frame of waveform) involves solving a set of simultaneous equations and this can be error-prone. Making a hard decision that might be inaccurate at such an early stage is a bad strategy. Also, the difference equation co-efficients are numerically unstable (but there are equivalents that are stable, such as LSFs – well out of scope for Speech Processing).

Why not just use the waveform samples as features?

Remember phase! The same sound/phoneme can have radically different-looking waveforms. So, how about the DFT magnitude spectrum? Better (no phase) but still contain F0 and is also high-dimensional. Maybe OK for neural models, definitely not for Gaussians. Filterbank is even better, but needs decorrelating – hence MFCCs.

Can we interpret the MFCCs (or the true cepstrum)?

Not easily – all we can say is that the filter is represented in the lower-quefrency range and the source can be found (for voiced speech) as a small peak at higher quefrency. That’s a possible method for F0 estimation, although not the one most commonly used (out of scope for Speech Processing – see Speech Synthesis in Semester 2).

Why do we add deltas to the MFCC features?

Full answer is coming when we talk about HMMs, but it’s to mitigate yet another independence assumption we are about to make: that each feature vector in the observation sequence is independent of the rest. (In the Module 7 live class I did not say “conditionally independent given the HMM state” but will being saying that in Module 8).

Won’t the deltas be highly correlated with the statics?

Please ask this again in Module 8!

 

Reading

Jurafsky & Martin – Section 9.3 – Feature Extraction: MFCCs

Mel-frequency Cepstral Co-efficients are a widely-used feature with HMM acoustic models. They are a classic example of feature engineering: manipulating the extracted features to suit the properties and limitations of the statistical model.

Taylor – Section 12.3 – The cepstrum

By using the logarithm to convert a multiplication into a sum, the cepstrum separates the source and filter components of speech.

This is a SKILLS tutorial about shell scripting.

The tutorial will teach you all the shell programming techniques you need to complete the second assignment, all the way up to fully automating your experiments.

Make sure you are comfortable in the shell – revisit the Linux course from Module 0 Tutorial B if necessary.

Browse the forums on shell scripting including this useful all-in-one mini-tutorial.

Prepare for the tutorial session

Practice writing shell scripts that use the techniques from the all-in-one mini-tutorial. Use the forums to get help and solve your problems, and bring unresolved problems to the tutorial. On the forums, post your script as text, not a screenshot and, where applicable, the full error message you get.

Complete all the milestones to date.