Forum Replies Created
-
AuthorPosts
-
To “force” the model to generate one sequence, we can only evaluate the probability that it generated that sequence. This is exactly what the decoding algorithm must do: it “decodes” how the model generated the given observation sequence. Typically, we will make an approximation, such as only decoding the single most likely state sequence.
I think perhaps you are being tempted to think of ASR as a pipeline of processes (the dreaded “flowchart” view)? That view leads us into thinking that certain things happen in certain “modules” and other things in other “modules”. Let’s try a different view, in which we only make the standard machine learning split into two phases: “training” and “testing (or recognition)”. Your question is about the second one, and assumes we have a fully-trained model.
We have a single, generative model of spoken utterances. The model can randomly generate any and all possible observation sequences. When the model generates a particular observation sequence, we can compute quantities such as the likelihood (just call that “probability” for now), the most likely state sequence, most likely word sequence, and so on.
Given a particular observation sequence to be recognised, we force our model to generate that particular sequence. We record the most likely word sequence and announce that as the result.
So, all we need is
a) the generative model – this will be a combination of the acoustic model and language model
b) one or more algorithms for computing quantities we are interested in – one of these algorithms will be called the decoding algorithm
Time to develop your debugging skills…
Two possibilities to get you started
1. you are not correctly loading multiple MLFs – post your full command line here and I’ll check it
2. there is a formatting error in one of the MLFs – how might you efficiently figure out which one that is? More generally, how might you sanity check an individual users data, before deciding to include it in one of your experiments?
Tip: add this command in a shell script to make the shell to print out each command, with the variables all replaced by their actual values, just before executing it – this can be helpful in debugging:
set -x
You can turn that behaviour off again with
set +x
So to debug just part of a shell script, wrap it like this:
... script does stuff here set -x HResults .... set +x ..... script continues here
Indeed, textbooks often suggest that you imagine the frequency axis to be time, then treat the FFT spectrum as a waveform. That’s fine, but we are smart people and know that the Fourier transform doesn’t only apply to time-domain signal: the horizontal axis can be labelled with anything we like.
You are worried that the cepstrum will fail to accurately capture high peaks in the spectrum. That’s a legitimate concerns. First, we can state that the cepstrum derived from the log magnitude spectrum will faithfully capture every detail, if we use enough cepstral co-efficients.
Your concern becomes relevant when we use (say) only the first 12 coefficients. When we do this (i.e., truncate the cepstrum), we are making an assumption about the shape of the spectral envelope. The fewer coefficients we use, the “smoother” we assume the envelope is.
The solution is empirical: try different numbers of cepstral coefficients and choose the number that works best (e.g., gives lowest WER in our speech recogniser).
For ASR, 12 coefficients is just right.
You could experiment with this number in the digit recogniser exercise. Just be careful to not store anything in the shared directory (everything there must use the original parameterisation) and to do everything in your own workspace. This will involve modifying the
make_mfccs
script and well as theCONFIG_for_coding
file. If you do this experiment, talk to the tutor first. Do it for a speaker-independent system with nice large training and testing sets.Regrettably, remote access is difficult to provide. The machines are frequently switched between Mac OS and a virtual Windows installation, which makes remote login impractical.
Although we do not have the resources to support you, the Build your own digit recogniser exercise should be relatively easy to set up on your own machine, especially on Mac or Linux. You would need to take a copy of the data from the shared folder. It is OK to copy the labels, the MFCC files, and the
info.txt
file only; do not copy the waveforms (they contain personal information). After the course is over, you must delete the data.Correct, the local distance in DTW is the geometric distance between the pair of feature vectors at a given point in the grid.
We hope that the total distance (usually denoted D), which is the sum of local distances, will be lowest for the template that actually corresponds to what was said in the unknown word.
For a single, given unknown word, DTW is repeated once for every template. In each case, DTW finds the best path that aligns the unknown word with the current template being tried. This results in a separate value for D for each template. We then compare all those D values and pick the lowest.
At this point in the course, it is indeed a little mysterious what is in the feature vectors. There’s a good reason for keeping you all in suspense: we need to know more about the generative model before making a final decision about the feature vectors.
In other words, we cannot do our feature engineering correctly until we know exactly what properties the generative model has. Specifically, we will need to know its limitations (what it can not model).
So, for now, let us pretend that the feature vector contains one of these possible sets of features:
- the FFT co-efficients, or
- the formant frequencies, or
- the energy in different frequency bands (a “filterbank”)
The mystery will be solved within a few lectures, when we will learn about MFCCs.
Custom shortcode for video player to allow changing away from video.js in the future and/or adding speed control.
Add a “show me all unread posts” feature to the forum.
Fix formatting of search results (e.g., http://www.speech.zone/?s=speech) to correctly align images, titles and excerpts. Use smaller (or no) images. Just requires CSS tweaks.
Formatting should probably be similar to that of archive pages.
November 2, 2016 at 12:03 in reply to: Weighting each dimension when measuring the local distance #5866Yes, in effect, this is what we will do when we move from measuring distance in a vector space, to using generative models.
In fact, we can do a lot more than just weighting each dimension. We can perform feature engineering to transform the feature space, such that our problem (e.g., classification) becomes easier.
In foundation lecture 6, we will take a first look at these ideas. We will then continue this topic in main lecture 7 when we engineer features (MFCCs) that work well with our chosen generative model (a Gaussian probability density function).
Festival alone cannot actually automatically label files. All it can do is process text through the front end to get a linguistic specification, which includes the sequence of phones.
The alignment is generally done using HMMs, often with the HTK toolkit.
For a language not supported by Festival, you need to use a TTS front-end, or be able to convert text into a string of phones some other way (e.g., by dictionary lookup). After that, the alignment step is the same as for English.
Results of the poll: of students who expressed a preference, 73% prefer the current room with its arrangement around group tables.
You could imagine doing speech recognition by measuring the cosine similarity between feature vectors. But this is not the usual way.
We typically use a generative model (the Gaussian, or Normal, probability density function) of feature vectors, within a generative model of sequences (a Hidden Markov Model).
Great question! Fourier analysis decomposes any signal into a sum of simple signals (called base functions): sine waves, each with a frequency, magnitude and phase.
Since sine waves are periodic, Fourier analysis can surely only be applied to periodic signals, can’t it? Correct. At least, only to signals that we assume are periodic.
Short-term analysis
For a signal such as speech, where the spectral envelope changes over time, we must always use short-term analysis techniques. That means taking a frame of the signal (typically 25ms) and making some assumptions about the signal within that frame.
We will assume that the spectrum doesn’t change at all within the frame: the signal is “stationary“.
Assumption that the signal is periodic
To apply Fourier analysis, we make another assumption: the signal is periodic. In the case of short term analysis, the Fourier analysis effectively assumes that the frame of signal is repeated over and over before and after the frame.
for sounds like fricatives, we effectively turn them into signals that repeat with a period of one frame. Since the frequency resolution of the Fourier transform is limited by the duration of the frame, we don’t actually see this “assumed periodicity” in the resulting spectrum: it’s at a frequency lower than we can resolve.
-
AuthorPosts