The complete process

A sketch of the complete process.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS
We talked about different ways of recognising patterns, and we talked about the simplest way we could think of, which is too much things We need to label the things we've already labelled.
And those things were already labelled.
We call templates or reference patterns.
And so we got this idea of comparing two things.
Those things of sequences we went through, a number of steps of thinking about how you might compare two sequences.
We came up with this idea of dynamic time warping, which is that we need to stretch them by different amounts to find a nice alignment between them.
And then we realised that the features that we were using might not be very good, And so we develop better features, and that's how far we've got.
Rather messy but colourful diagram here is taking the speech signal at the top notice that we are labelling our axes.
There was something in the feedback about that we're taking this way for for each frame, so we're taking 25 millisecond frames.
We're applying a tapered window, so we get rid of edge effects and don't have discontinuities.
We get window frames, we take their spectrum and we saw that fazes perceptually, very unimportant.
So we'll get rid of the phase and just take the magnitude spectrum.
That's a good set of features, except that it's got evidence of zero in it.
These harmonics, this line structure.
So we're going to smooth out, away.
We just been to blur it, just like looking through a blurry lens on the way we do that is with this philtre banks of these Philtres a wider than the spacing between any two harmonics of F zero.
So the width of the philtres is several times bigger than fundamental frequency on.
While we're doing that, we can play another couple of nice tricks, weaken space them further and further apart of the frequency range to simulate what happens in human hearing.
We might do that on a scale which is the male scale.
On the outputs of these philtres will be a nice, simple, low dimensional representation of spectral envelope walked onto a male scale within newsome amplitude range compression.
Another thing that human hearing system does by taking the log will get this log magnitude spectrum on a male scale.
These are the outputs of this philtre bank.
This set of features would actually be a really good set of features to do speech recognition.
But if we're going to fit Gaussian distributions to it, we need to get rid of the correlation.
So this is represented by the operas of Philtres.
So discrete points here, maybe 20 25 of them and adjacent philtre bank outputs are highly correlated.
So if you fit a galaxy into them, we're gonna have extra parameters.
That model that Cove Arians.
We don't want to do that because the number of co variances scales like square of the dimension of the features.
And that's a very bad scaling.
So we d correlate these features.
If we were not using Garcia.
If we're using some other model, for example, were pushing these things through a neural network to do classifications, we could just stop here.
So these are good features.
They just they haven't inconvenient statistical property of Carl Correlation, but we don't want to have parameters for our model, so we d correlated through the magic of the discreet coastline transform that might get called other things.
It's very much like a Fourier transform that just captures the shape in a set of coefficient.
Think of it as a sort of shape coefficients, and the shape coefficients gets tapped into a vector with magic features called Mel Frequency Capital Coefficients.
And these are the features we're going to do speech recognition with when we're fitting calcium to these features.
Each item in here represents something about the shape of the spectrum and is essentially not correlated with its neighbours.
So we can fit separately guardians to each of the elements in this vector or a Multivariate Gaussian with no co variance parameters.