This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITSwe're going to motivate it, actually, by thinking about human perception.So as we do this as we do this shrinking of the dimensionality, this D correlation, we're going to fold in some other nice tricks on a fold in some knowledge of human perception to try and throw away information that we know it's not important.So if you've got 1000 bins in our FFT Magnitude spectrum were trying to squash that down to, let's say, 12 numbers, what do we throw away? How do we decide what to throw away? Think about the human perceptual system.So the particularly in the air, the cochlear does all sorts of non linear transformations on the system.One particular thing that it does is that intensity.So the perceptual correlated, loudness.Loudness is not linearly proportional to the amplitude signals.So if these way forms here who make them half the size, they're not.Just half the loudness is a very normal in a relationship with beating the energy in the way form on the loudness on that, so that we can hear very quiet things and very loud things with this compressive function in a hearing system.And it's something like taking the longer So it's a non linear, compressive function.Things have to have many times as much energy just to sound twice as loud.So let's try and build that in.That's probably important.We have a limited range in our frequencies, so maybe this applies to you.Not sure any longer applies to me.We have a lower limit on an upper limit on what we can here below the lower limit.We'll just here in individual pulses will start hearing things in the time domain above a high limit.We just won't hear too tall a little hair cells, not Cokely will have died off.And if you don't want hair cells of the coffee, or are they going to do some readings? Readings so we can limit the frequency range, which we analyse and in fact, 16 killers is going to be too high? There's not much information about eight kilohertz, so we're just limit our analysis.Just eight kilohertz will represent nothing about that.Well, a lower lower limit will just be quite small.20 hats or close to zero.Next important thing is that our ability to discriminate between frequencies rapidly deteriorates the higher frequencies.So I play you too pure tones and ask you other the same or different.You could do a very good job at low frequencies that could be very close, and you can hear the difference.But high frequencies.They have to be much further apart.Tens, possibly hundreds of hurts apart before you can say that they're different.So this non linearity in the frequency scale on this non linearity in the Amplitude Scout.So vain, nonlinear.That's so we can concentrate on the important energies in the right frequency range on the right loudness range.It's going to build these things, and as we derive our low dimensional feature vector, so we'll use these to motivate.We're not going to build a model of human hearing.We're not going to literally model the cochlear with fancy mathematical model.We're just going to do something that exhibit similar properties to this, so we'll limit the frequency range eight kilohertz.And so what's the sampling rate of the way for me to be if we want to analyse things up, eh kilohertz? £16 think Nyquist frequency.So what does that mean? It means whenever we record speech for the purpose of doing automatic speech recognition.It's highly likely we're going to convert it to a sampling rate of 16 killer before we do anything else.So when Google do their subtitling on the YouTube videos, the audio track, the people that load might be 44.1 kilohertz, a standard rate to go along with the video down sample to 16 kilohertz, do the picture extraction and do the speech recognition.I won't do it directly from the original soundtrack We're going to build in this frequency.Discrimination thing gets worse.Frequent increases we're going to build in this amplitude compression the fact that really allow things get squashed to be less different in loudness.Okay, so how do we do it? Well, we limit the sampling rate.16 kilohertz was on the slide.If you read ahead, we're gonna walk the frequency scale we're not going to use hurts anymore.We're going to use something that's a bit like the frequency scale in Qalqilya.Are we going to normally compress the amplitude, for example, by taking a look around
Feature engineering
Those requirements can be imposed by a little clever engineering.
Log in if you want to mark this as completed
|
|