Live class

Here is a link to a recording of the class, along with written notes for the main questions asked in-class.

Find the recording in the General channel on Teams, or via this link. Here is a summary of the main questions asked in the live class. You may need to use the pop-out link to see the full text.

How exactly do MFCCs separate source and filter?

The cepstrum “unpacks” / “spreads out” / “separates” the source and filter along the quefrency axis, making it easy to draw a line between them. Truncation of the cepstrum results in only retaining the filter. Although MFCCs are not the true cepstrum (because of the mel scale and the filterbank), they have this same property.

If the series expansion reduces covariance between MFCCs, then does the DFT also result in reduced covariance in the spectrum compared to the waveform?

I cannot find any literature or empirical evidence to answer this question. What we can say is that the Fourier series expansion of the DFT “unpacks” the information in waveform samples and lays it out along the frequency axis so that we can – at least visually – attribute aspects to the source (harmonics) and filter (envelope). The cepstral cosine series expansion “unpacks” the spectrum and lays it out along the quefruency axis.

Why not use an explicit source-filter model (with LPC filter) instead of MFCCs?

A very sensible proposition – we could, and people did in the past. Before MFCCs, this was the dominant approach. Solving for the co-efficients of the difference equation (given a frame of waveform) involves solving a set of simultaneous equations and this can be error-prone. Making a hard decision that might be inaccurate at such an early stage is a bad strategy. Also, the difference equation co-efficients are numerically unstable (but there are equivalents that are stable, such as LSFs – well out of scope for Speech Processing).

Why not just use the waveform samples as features?

Remember phase! The same sound/phoneme can have radically different-looking waveforms. So, how about the DFT magnitude spectrum? Better (no phase) but still contain F0 and is also high-dimensional. Maybe OK for neural models, definitely not for Gaussians. Filterbank is even better, but needs decorrelating – hence MFCCs.

Can we interpret the MFCCs (or the true cepstrum)?

Not easily – all we can say is that the filter is represented in the lower-quefrency range and the source can be found (for voiced speech) as a small peak at higher quefrency. That’s a possible method for F0 estimation, although not the one most commonly used (out of scope for Speech Processing – see Speech Synthesis in Semester 2).

Why do we add deltas to the MFCC features?

Full answer is coming when we talk about HMMs, but it’s to mitigate yet another independence assumption we are about to make: that each feature vector in the observation sequence is independent of the rest. (In the Module 7 live class I did not say “conditionally independent given the HMM state” but will being saying that in Module 8).

Won’t the deltas be highly correlated with the statics?

Please ask this again in Module 8!