› Forums › Automatic speech recognition › Features › Filter bank vs. filter coefficients
- This topic has 5 replies, 3 voices, and was last updated 7 years, 2 months ago by Simon.
-
AuthorPosts
-
-
November 7, 2017 at 15:37 #8282
In the Module 6 section (http://www.speech.zone/courses/speech-processing/module-6-speech-recognition-pattern-matching/), it is written that a feature vector could either be a spectrum, a filterbank or co-efficients of a source filter model. What is the difference between filterbank and filter co-efficients? How is the input’s analysis different with these two methods?
-
November 7, 2017 at 21:33 #8284
In a filterbank, there are a set of bandpass filters (perhaps 20 to 30 of them). Each one selects a range (or a “band”) of frequencies from the signal.
The filters in a filterbank are fixed and do not vary. We, as the system designer, choose the frequency bands – for example, we might space them evenly on a Mel scale, taking inspiration from human hearing.
The feature vector produced by the filterbank is a vector containing, in each element, the amount of energy captured in each frequency band.
The filter in a source-filter model is a more complex filter than the ones in a filterbank, in two ways:
- it’s not just a simple bandpass filter, but has a more complex frequency response, in order to model the vocal tract transfer function
- it varies over time (it can be fitted to an individual frame of speech waveform)
This filter is inspired not by human hearing, but by speech production.
The simplest type of feature vector derived from the filter in a source-filter model would be a vector containing, in each element, one of the filter’s coefficients. Together, the set of filter coefficients captures the vocal tract transfer function (or, more abstractly, the spectral envelope of the speech signal).
-
November 8, 2017 at 14:30 #8286
But why are MFCCs better features than filter coefficients? Shouldn’t they both ultimately model the same underlying thing – the shape of the vocal tract at production? I do not see why speech recognition could not equally well build a model from the speaker’s or the listener’s point of view (even human listening is sometimes hypothesised to be based on our own model of/experience with production…)
-
November 8, 2017 at 15:02 #8287
An excellent question. Yes, there are many ways to represent and parameterise the vocal tract frequency response, or more generally the spectral envelope.
Let’s break the answer down into two parts
1) comparing MFCCs with vocal tract filter coefficients
There are many choices of vocal tract filter. The most common is a linear predictive filter. We could use the coefficients of such as filter as features, and in older papers (e.g., where DTW was the method for pattern matching) we will find that this was relatively common. A linear predictive filter is “all pole” – that means it can only model resonances. That’s a limitation. When we fit the filter to a real speech signal, it will will give an accurate representation of the formant peaks, but be less accurate at representing (for example) nasal zeros. In contrast, the cepstrum places equal importance on the entire spectral envelope, not just the peaks.
2) comparing MFCCs with filterbank outputs
It is true that MFCCs cannot contain any more information than filterbank outputs, given that they are derived from them.
There must be another reason for preferring MFCCs in certain situations. The reason is that there is less covariance (i.e., correlation) between MFCC coefficients than between filterbank outputs. That’s important when we want to fit a Gaussian probability density function to our data, without needing a full covariance matrix.
You also make a good point that we can seek inspiration from either speech production or speech perception. In fact, we could use ideas from both in a single feature set – a example of that would be Perceptual Linear Prediction (PLP) coefficients. This is beyond the scope of Speech Processing, where we’ll limit ourselves to filterbank outputs and MFCCs.
-
November 8, 2017 at 15:31 #8288
Cool, thank you that is a super interesting topic!
Just another follow-up to this: Holmes and Holmes (2001, 159) write: “it seems desirable not to use features of the acoustic signal that are not used by human listeners, even if they are reliably present in human productions”. Why this limitation? If the machine can “hear” and interpret it (as in, use it to get a better classification accuracy), why does it matter whether humans can?
They give this reason: “because they may be distorted by the acoustic environment or electrical transmission path without causing the perceived speech quality to be impaired”. Not a very convincing argument to me. The same would apply to features we ARE using I would say.
I understand why “the human model” is good for inspiration, but why limit ourselves? Humans can use a lot of information a machine can’t; maybe there are some benefits in exploiting the specific strengths of the machine (e.g., greater sensitivity to different frequencies, ability to measure phase, doesn’t get distracted…) to make up for that?
-
November 8, 2017 at 15:48 #8289
I don’t find Holmes & Holmes’ argument about transmission channels very convincing either.
Their point is that machines should not be able to “hear” something that humans cannot, and that might turn out to be a good idea when it comes to privacy and security of voice-enabled devices. Here’s one reason:
and here’s another more extreme form of attack on ASR systems.
-
-
AuthorPosts
- You must be logged in to reply to this topic.