› Forums › Automatic speech recognition › Features › Spectral Envelope Confusion
- This topic has 1 reply, 2 voices, and was last updated 3 years, 9 months ago by Simon.
-
AuthorPosts
-
-
November 16, 2020 at 09:40 #13086
I am really confused as to how, conceptually, the MFCC features separate source and filter and give us the spectral envelope.
Here’s what I understand so far – we have a signal, we window it so we can assume stationary features, we then take the DFT, apply the Mel filterbank, take the log of this spectrum (giving us the cepstrum?), do the discrete cosine transform (gets rid of covariance – through orthogonal cosines – so we don’t have to model this in the Gaussians) and finally get the MFCC features. The first 12 or so are the source, the rest are the filter.What I don’t understand is how this gives us the spectral envelope and separates source and filter – how do we know the first 12 are the source? Further I thought the Mel filterbank removed the F0 and blurred the harmonics because of the width of the filters, so what evidence of the filter would be left? I thought the harmonics were needed to give us the spectral envelope because formants are points at which harmonics have greater intensity because of the filter. Really uncertain about this.
-
November 16, 2020 at 15:54 #13091
The cepstrum is created after the cosine transform – the cepstral coefficients are the coefficients of that series expansion (the weights on the cosine basis functions).
Any form of source-filter separation has to make some assumption about either the form of the filter, or of the source (or both). Otherwise, the problem is insoluble.
For this discussion, let’s assume “the vocal tract filter’s frequency response” and “the spectral envelope” are the same thing.
We assume the lower cepstral coefficients represent the vocal tract filter’s frequency response because they capture the slower-changing (with respect to frequency) components of the spectrum. We are making the assumption that the vocal tract filter’s frequency response is rather slowly-changing (with respect to frequency).
You are right that the mel-scale filterbank was designed to smooth away F0, so in fact the truncation step might not be needed for that purpose. Nevertheless, we still want to truncate so that we have small number of features in our final feature vector. Truncation serves multiple purposes, of which “removal of any remaining traces of F0” is just one.
You are right that we can only observe the vocal tract filter’s frequency response (which includes the formant peaks) at frequencies where the source has energy. For voiced speech, that means at the harmonics. But the vocal tract filter’s frequency response exists at all frequencies, even between harmonics – it’s just that we can’t see it directly. Think of it as “joining the dots” that are the harmonics.
-
-
AuthorPosts
- You must be logged in to reply to this topic.