Forum Replies Created
-
AuthorPosts
-
Seems that grad student has left and their home page has gone. I can’t find a good replacement – let me know if come across anything.
Harmonic spacing: the interval (“distance, in frequency”) between two adjacent harmonics, in voiced speech. This interval will be equal to F0 since there is a harmonic at every integer multiple of F0.
Effective filter bandwidth: the cochlear can be thought of as a filterbank – a set of bandpass filters. The centre frequencies of the filters are not evenly spaced on a linear frequency scale. They get more widely spaced at higher frequencies. They also have a larger bandwidth (“width”) with higher frequency.
For the filters at higher frequencies, this bandwidth is greater than the harmonic spacing. Therefore, the cochlea cannot resolve the harmonics at higher frequencies. Rather, the cochlea only captures the overall spectral envelope.
These facts about human hearing are the inspiration for the Mel-scale triangular filterbank commonly used as part of the sequence of processes for extracting MFCCs from speech.
The cepstrum is invertible because each operation is invertible (e.g., the Fourier transform). However, MFCCs are not exactly the same as the cepstrum:
Normally, we discard phase and only retain the magnitude spectrum, from which we compute the (magnitude) cepstrum.
If we use a filterbank to warp the spectrum to the Mel scale (which is the normal method in ASR), this is not invertible: information is lost when we sum the energy within each filter.
For ASR, we truncate the cepstrum, retaining only the first (say) 12 coefficients. This is a loss of information.
Questions to think about:
- why do we discard phase?
- why do we use a filterbank?
- why do we truncate the cesptrum?
In this one, very special case, where P(W) is a constant (i.e., it does not depend on W) we could indeed omit it
but in the more general case, where P(W) varies for different W, then we must of course compute it
Simon
To “force” the model to generate one sequence, we can only evaluate the probability that it generated that sequence. This is exactly what the decoding algorithm must do: it “decodes” how the model generated the given observation sequence. Typically, we will make an approximation, such as only decoding the single most likely state sequence.
I think perhaps you are being tempted to think of ASR as a pipeline of processes (the dreaded “flowchart” view)? That view leads us into thinking that certain things happen in certain “modules” and other things in other “modules”. Let’s try a different view, in which we only make the standard machine learning split into two phases: “training” and “testing (or recognition)”. Your question is about the second one, and assumes we have a fully-trained model.
We have a single, generative model of spoken utterances. The model can randomly generate any and all possible observation sequences. When the model generates a particular observation sequence, we can compute quantities such as the likelihood (just call that “probability” for now), the most likely state sequence, most likely word sequence, and so on.
Given a particular observation sequence to be recognised, we force our model to generate that particular sequence. We record the most likely word sequence and announce that as the result.
So, all we need is
a) the generative model – this will be a combination of the acoustic model and language model
b) one or more algorithms for computing quantities we are interested in – one of these algorithms will be called the decoding algorithm
Time to develop your debugging skills…
Two possibilities to get you started
1. you are not correctly loading multiple MLFs – post your full command line here and I’ll check it
2. there is a formatting error in one of the MLFs – how might you efficiently figure out which one that is? More generally, how might you sanity check an individual users data, before deciding to include it in one of your experiments?
Tip: add this command in a shell script to make the shell to print out each command, with the variables all replaced by their actual values, just before executing it – this can be helpful in debugging:
set -x
You can turn that behaviour off again with
set +x
So to debug just part of a shell script, wrap it like this:
... script does stuff here set -x HResults .... set +x ..... script continues here
Indeed, textbooks often suggest that you imagine the frequency axis to be time, then treat the FFT spectrum as a waveform. That’s fine, but we are smart people and know that the Fourier transform doesn’t only apply to time-domain signal: the horizontal axis can be labelled with anything we like.
You are worried that the cepstrum will fail to accurately capture high peaks in the spectrum. That’s a legitimate concerns. First, we can state that the cepstrum derived from the log magnitude spectrum will faithfully capture every detail, if we use enough cepstral co-efficients.
Your concern becomes relevant when we use (say) only the first 12 coefficients. When we do this (i.e., truncate the cepstrum), we are making an assumption about the shape of the spectral envelope. The fewer coefficients we use, the “smoother” we assume the envelope is.
The solution is empirical: try different numbers of cepstral coefficients and choose the number that works best (e.g., gives lowest WER in our speech recogniser).
For ASR, 12 coefficients is just right.
You could experiment with this number in the digit recogniser exercise. Just be careful to not store anything in the shared directory (everything there must use the original parameterisation) and to do everything in your own workspace. This will involve modifying the
make_mfccs
script and well as theCONFIG_for_coding
file. If you do this experiment, talk to the tutor first. Do it for a speaker-independent system with nice large training and testing sets.Regrettably, remote access is difficult to provide. The machines are frequently switched between Mac OS and a virtual Windows installation, which makes remote login impractical.
Although we do not have the resources to support you, the Build your own digit recogniser exercise should be relatively easy to set up on your own machine, especially on Mac or Linux. You would need to take a copy of the data from the shared folder. It is OK to copy the labels, the MFCC files, and the
info.txt
file only; do not copy the waveforms (they contain personal information). After the course is over, you must delete the data.Correct, the local distance in DTW is the geometric distance between the pair of feature vectors at a given point in the grid.
We hope that the total distance (usually denoted D), which is the sum of local distances, will be lowest for the template that actually corresponds to what was said in the unknown word.
For a single, given unknown word, DTW is repeated once for every template. In each case, DTW finds the best path that aligns the unknown word with the current template being tried. This results in a separate value for D for each template. We then compare all those D values and pick the lowest.
At this point in the course, it is indeed a little mysterious what is in the feature vectors. There’s a good reason for keeping you all in suspense: we need to know more about the generative model before making a final decision about the feature vectors.
In other words, we cannot do our feature engineering correctly until we know exactly what properties the generative model has. Specifically, we will need to know its limitations (what it can not model).
So, for now, let us pretend that the feature vector contains one of these possible sets of features:
- the FFT co-efficients, or
- the formant frequencies, or
- the energy in different frequency bands (a “filterbank”)
The mystery will be solved within a few lectures, when we will learn about MFCCs.
Custom shortcode for video player to allow changing away from video.js in the future and/or adding speed control.
Add a “show me all unread posts” feature to the forum.
Fix formatting of search results (e.g., http://www.speech.zone/?s=speech) to correctly align images, titles and excerpts. Use smaller (or no) images. Just requires CSS tweaks.
Formatting should probably be similar to that of archive pages.
November 2, 2016 at 12:03 in reply to: Weighting each dimension when measuring the local distance #5866Yes, in effect, this is what we will do when we move from measuring distance in a vector space, to using generative models.
In fact, we can do a lot more than just weighting each dimension. We can perform feature engineering to transform the feature space, such that our problem (e.g., classification) becomes easier.
In foundation lecture 6, we will take a first look at these ideas. We will then continue this topic in main lecture 7 when we engineer features (MFCCs) that work well with our chosen generative model (a Gaussian probability density function).
-
AuthorPosts