› Forums › Automatic speech recognition › Features › Log of the mel values
- This topic has 3 replies, 3 voices, and was last updated 8 years, 8 months ago by Simon.
-
AuthorPosts
-
-
November 10, 2015 at 22:36 #601
Jurafsky and Martin (chapter 9.3.4) says that taking the log of each of the mel spectrum values makes the feature estimates less sensitive to variations in input (e.g. moving closer or further from the microphone). Why is that?
-
November 11, 2015 at 11:00 #602
Their explanation is that the log is a form of dynamic range compression, which is a standard technique in audio engineering used to narrow the range of energies found in a signal.
Another motivation might be to simulate a property of human hearing, which also involves a kind of dynamic range compression so that we can hear very quiet sounds but also tolerate very loud sounds.
However, there is a much better theoretical motivation for taking the logarithm in the spectral domain when extracting MFCCs: It’s there to convert a convolution in the time domain, which is the same as a multiplication in the spectral domain (of the source spectrum and vocal tract filter frequency response) into an addition, so that the source and filter can be separated more easily.
Transforming a signal into a domain where a convolution has become addition is called “homomorphic filtering”.
The process of extracting MFCCs from a waveform is approximately a type of homomorphic filtering.
-
December 8, 2015 at 10:50 #1080
Why is it necessary to take logs of the Mel values when logs are taken in the process of converting frequencies from Hz to Mel anyway?
-
December 8, 2015 at 12:03 #1081
You’re mixing up two distinct processes.
Warping the frequency scale
There are a variety of perceptually-motivated frequency scales, and we could choose any of them (Mel, Bark, ERB, …). They all have something in common, and that is that they are non-linear. This non-linearity might or might not be implemented as a logarithm, but note that we are not taking the logarithm of the energy of the speech signal, we are just warping the frequency scale. Think of it as stretching the vertical axis in a spectrogram so that the lower frequencies occupy more of the scale, and that higher frequencies are squashed into less of the scale.
Taking logarithms of filterbank outputs
Here is where taking logs is crucial: this is the point at which we convert a multiplication in the frequency domain (the source spectrum has been multiplied by the vocal tract frequency response) into an addition (a sum) in the cepstral domain.
After that multiplication-to-addition conversion, then we can split the source and frequency contributions to the sum. This is possible because their contributions are at different quefrencies. By using a cosine series expansion, we spread these contributions out along the quefrency scale and can then – for example – ignore those parts that relate to the source.
-
-
AuthorPosts
- You must be logged in to reply to this topic.