Forum Replies Created
-
AuthorPosts
-
First, you can and should only run
make_mfccs
for your own data. (The only exception would be an advanced experiment varying the parameterisation, and you should talk that over with me in a lab session before attempting it).There are ongoing permissions problems on the server, and so I’ve reset them again. Please try again and report back.
Each model is of one digit. There are always 10 models (look in the models directory to see them). This is the same, regardless of what data you train the models on.
You need to run
HInit
(and thenHRest
) once for each model you want to train. The basic scripts do this already for you, using afor
loop, and you’ll keep that structure for speaker-independent experiments too.November 17, 2019 at 16:00 in reply to: Sanity-checking script returns higher WER than run_experiments #10260HVite -T 1 -C resources/CONFIG \ -d models/hmm1 \ -l rec \ -w resources/grammar_as_network \ resources/dictionary \ resources/word_list \ -S file_lists/testingscriptfile.scp \
is wrong – the -S argument must go before the dictionary and word_list. You should be getting an error message from this that you have missed, I suspect. Also, remove the final “\”, because that means “continues on the next line”.
Even if that fixes the problem, you should still put in more error checking to all your scripts, to be confident everything is working correctly.
You’ve done one obvious thing: deleting all trained HMMs in both
hmm0
andhmm1
before every experiment. You can also delete everything in therec
directory too.Then start checking things such as
- Training is finding the expected number of examples of each digit, in both training phases
- You are using the exact same data for
HInit
andHRest
- Recognition is processing the expected number of files
- Results scoring is finding the expected number of examples of each digit
You really should get identical results, so it’s worth tracking this bug down before proceeding.
(It might be co-incidence, but 10 is twice 5 and 3.33 is twice 1.67.)
Neither.
Think about what it means to train a model: to estimate its parameters from some data. If you want to model all the data, then the training algorithm will need all of that data at once.
So, to train a model of any given digit, on any data set (whether from one person or many), we run
HInit
once, andHRest
once, loading all of the data.We repeat that for each digit model we need (there are
for
loops in the scripts that do that).The figure uses a non-linear frequency scale on the vertical axis.
The lower frequency channels are closer together on a linear frequency scale and this is what the figure shows.
Compare the 3 channels covering the lower frequency range of 0 Hz to 1 kHz (which is a range of 1 kHz) with the 3 channels covering the higher frequency range from 3 kHz to 5kHz (which is a range of 2 kHz). The ones in the lower frequency range are closer together (i.e., there are more “channels per kHz”) than those in the higher range.
Yes, you can use whatever language you like – we will not be marking your code, only your lab report.
Hint: you will probably want to use some HTK tools as part of the sanity checking. You can of course run those from Python.
November 15, 2019 at 19:20 in reply to: ./scripts/results returns "unable to open label file" inconsistently #10239HTK looks for the reference labels first in the MLF(s) you provide, and failing that in the same directory as the
.rec
files that it will compare to those reference labels.First, make sure you are loading the correct MLF.
This error indicates that HTK didn’t find s2002376_test01.lab in any of the MLFs you provided. That will be because of a formatting error in your MLF. Fix that, re-run
make_mfccs
(which copies it to the data directory) and try again.Take a look at other MLFs (e.g., the one for simonk) in the main
data
directory for reference and look for differences in yours.Common mistakes include: missing quotation marks around each filename, incorrect extension on the filename, typos in the filename, one or more spaces after any of the periods, or a missing newline at the end of the file.
November 15, 2019 at 07:54 in reply to: ./scripts/results returns "unable to open label file" inconsistently #10228Solve the error with
HVite
first.Unable to open label file rec/kateb_test?(_)??.rec
means there were no
.rec
files matching that pattern.If recognition fails, then there will be no output (which
HVite
saves in in.rec
files) forHResults
to score.The
HVite
error about failing to open the audio device suggests that you are trying to run “live” recognition instead of using a stored.mfcc
file.Use
set -x
in your script and it will print out the completeHVite
command that is being run. This will help you spot what arguments are incorrect or missing. Perhaps you are not providing an.mfcc
file?I suspect you might have tried to comment out a line like this:
HInit -T 1 \ -G ESPS \ -m 1 \ -C resources/CONFIG \ -l $WORD \ -M models/hmm0 \ -o $WORD \ -L ${DATA}/lab/train \ models/proto/$PROTO \ ${DATA}/mfcc/train/simonk_train.mfcc \ # ${DATA}/mfcc/train/pilar_train.mfcc \ ${DATA}/mfcc/train/jason_train.mfcc
which is not allowed – it would be like trying to comment out the middle of a single long line.
Or, perhaps you have a space after one of your
\
which is also not allowed – this will mark the end of the commandHInit -T 1 \ -G ESPS \ -m 1 \ -C resources/CONFIG \ -l $WORD \ -S file_lists/myscriptfile.scp \ <-- IS THERE A SPACE HERE ? -M models/hmm0 \ -o $WORD \ -L ${DATA}/lab/train \ models/proto/$PROTO
Using
set -x
in a shell script will cause all commands to be printed out in full, with all their arguments completed. So, you can find the command being run and you will probably find that it is missing some of the arguments.Yes, it’s as simple as taking the first 12 values starting from the left on the horizontal axis. We call this operation “truncation”.
They are ordered (the lower ones capture broad spectral shape, higher ones capture finer and finer details), so this is theoretically well-motivated.
The DCT provides us with a representation (a set of coefficients) that has less correlation within its dimensions (the coefficients) than the filterbank outputs have between themselves.
This representation is called the cepstrum, which is in a different domain to the spectrum (which is of course in the frequency domain). We can plot the cepstrum – for example Taylor (2009) figure 12.11(c). The horizontal axis is actually time, which is what happens when you take the DCT (or the very similar inverse Fourier transform) of something in the frequency domain. But this is not the key point.
Key points to note in Taylor’s figure are
- the lower order cepstral coefficients have larger magnitudes than the higher order ones: this is telling us that we can make a good approximation to the original spectrum using just a few coefficients – that is, by truncating the cepstrum; Taylor suggests about 30 coefficients are needed, but for ASR we usually go lower, to 12
- there is a small peak in the cepstrum at about 120 samples – this corresponds to the fundamental period, T0 (= 1/F0); this tells us that truncating the cepstrum will effectively remove evidence of F0, which is what we want to do for ASR
Taylor is separating it out here because he is trying to show how the equations align with the physics of sound propagation in the vocal tract.
Lip radiation can be assumed a constant effect: effectively, a filter that boosts high frequencies. This filtering effect is independent of the configuration of the articulators (Taylor, 2009, equation 11.29).
Furthermore, the constant high-pass filtering effect of lip radiation is more than cancelled out by another constant effect of low-pass filtering at the sound source:
It is this, combined with the radiation effect, that gives all speech spectra their characteristic spectral slope.
(Taylor, 2009, page 332)
So, we don’t need any learnable model parameters for these effects. We can account for them either by absorbing this constant effect into the vocal tract filter (which might be modelled using linear prediction) or by pre-emphasising the signal in the time domain (Taylor, 2009, page 375) to make its spectrum flatter, before any subsequent modelling, processing or feature extraction.
Pre-emphasis is standard practice in most speech processing – can you find where this is done in the digit recogniser ?
Could you re-upload your textgrid file but changing the suffix to .txt (you’ll need to do that on the command line)
The Fourier transform is invertible, which means that you get as many data points out as you put in. This means that using a longer analysis window in the time domain gives you more points (often referred to as “FFT bins”) in the frequency domain magnitude spectrum (we will ignore phase). Those points (“bins”) are equally spaced from 0 Hz up to the Nyquist frequency.
So, a longer analysis frame in the time domain means that the spectrum has more detail: higher frequency resolution.
The Fourier spectrum is effectively an average over the entire analysis frame (formally, we are making an assumption that the spectrum is constant throughout the frame). A longer time window means averaging over a longer duration of the signal and thus being less precise in the time domain: lower time resolution.
There is a trade-off between time and frequency resolution. In practical terms, this means we have to choose an analysis frame length that is appropriate for our signal. For speech, 25 ms is a common choice.
The choice between red and green could be arbitrary, but it would be better to choose whichever gives you the lowest global distance (as in DTW).
There are many problems with pattern matching using exemplars, whether using linear or dynamic alignment, and the HMM will solve most of them. For example, the number of local distances summed together might vary with different alignments, making it hard to fairly compare them
Holmes & Holmes Chapter 8 details various solutions that were proposed for some of these problems, such as introducing ad-hoc penalties, or placing restrictions on how paths move through the grid (e.g., slope constraints). These all demonstrate that the method has many problems and ad-hoc design decisions.
-
AuthorPosts