Forum Replies Created
-
AuthorPosts
-
The labels are for different things in training and testing.
For the training data, we need to know the start and end time of each training example because
HRest
requires this: it trains one model at a time.For the test data, we simply need the correct label for each test
.mfcc
file, to use as a reference when computing the WER.You should reference the manual for whatever version you are using (you can find that by running any HTK program with just the
-V
flag).Both algorithms find alignments between states and observations.
Both algorithms express this alignment as the probability of an observation aligning with a state, which is the same thing as the state sequence. In Viterbi, we can think of those probabilities being “hard” or 1s and 0s because there just one state sequence is considered. In Baum-Welch, they will be “soft” because all state sequences are considered.
These probabilities are then used as the weights in a weighted sum of observations to re-estimate the means of the Gaussians. The weights will not sum to one, so this weighted sum must be normalised by dividing by the sum of the weights.
Remember that in this course, we’re not deriving the equations to express the above. So bear in mind that whilst the concept of “hard” and “soft” alignment is perfectly correct, the exact computations of the weights might be slightly more complex.
In training, W is constant for each training sample, so no language model is needed.
The objective function of both algorithms when used for training is to maximise the probability of the observations given the model: P(O|W). This is called “maximum likelihood training” and is the simplest and most obvious thing to do.
(However, this simple objective function does not directly relate to the final task, which is one of classification. So, in advanced courses on ASR, other objective functions would be developed which more directly relate to minimising the recognition error rate. Those are much more complex.)
The states in a whole word model do not correspond to phonemes. Figure 9.4 in Jurafsky & Martin (2nd edition) implies this is the case but what they are doing is constructing a word model from sub-word (phoneme) models, and their phoneme models have a single state (which is not common – normally we use 3 states). The figure is misleading.
The number of emitting states in a model is a design choice we need to make. As you correctly say, more states means we will need more training data, because the model will have more parameters.
In the digit recogniser assignment, there are a variety of “prototype” models that have varying numbers of states, for you to experiment with. It’s certainly worth doing an experiment to investigate this; make sure it’s one using large training and test sets, not just a single speaker.
You could try using a different number of states for each digit in the vocabulary, but that’s probably not the most fruitful line of experiments.
First, you can and should only run
make_mfccs
for your own data. (The only exception would be an advanced experiment varying the parameterisation, and you should talk that over with me in a lab session before attempting it).There are ongoing permissions problems on the server, and so I’ve reset them again. Please try again and report back.
Each model is of one digit. There are always 10 models (look in the models directory to see them). This is the same, regardless of what data you train the models on.
You need to run
HInit
(and thenHRest
) once for each model you want to train. The basic scripts do this already for you, using afor
loop, and you’ll keep that structure for speaker-independent experiments too.November 17, 2019 at 16:00 in reply to: Sanity-checking script returns higher WER than run_experiments #10260HVite -T 1 -C resources/CONFIG \ -d models/hmm1 \ -l rec \ -w resources/grammar_as_network \ resources/dictionary \ resources/word_list \ -S file_lists/testingscriptfile.scp \
is wrong – the -S argument must go before the dictionary and word_list. You should be getting an error message from this that you have missed, I suspect. Also, remove the final “\”, because that means “continues on the next line”.
Even if that fixes the problem, you should still put in more error checking to all your scripts, to be confident everything is working correctly.
You’ve done one obvious thing: deleting all trained HMMs in both
hmm0
andhmm1
before every experiment. You can also delete everything in therec
directory too.Then start checking things such as
- Training is finding the expected number of examples of each digit, in both training phases
- You are using the exact same data for
HInit
andHRest
- Recognition is processing the expected number of files
- Results scoring is finding the expected number of examples of each digit
You really should get identical results, so it’s worth tracking this bug down before proceeding.
(It might be co-incidence, but 10 is twice 5 and 3.33 is twice 1.67.)
Neither.
Think about what it means to train a model: to estimate its parameters from some data. If you want to model all the data, then the training algorithm will need all of that data at once.
So, to train a model of any given digit, on any data set (whether from one person or many), we run
HInit
once, andHRest
once, loading all of the data.We repeat that for each digit model we need (there are
for
loops in the scripts that do that).The figure uses a non-linear frequency scale on the vertical axis.
The lower frequency channels are closer together on a linear frequency scale and this is what the figure shows.
Compare the 3 channels covering the lower frequency range of 0 Hz to 1 kHz (which is a range of 1 kHz) with the 3 channels covering the higher frequency range from 3 kHz to 5kHz (which is a range of 2 kHz). The ones in the lower frequency range are closer together (i.e., there are more “channels per kHz”) than those in the higher range.
Yes, you can use whatever language you like – we will not be marking your code, only your lab report.
Hint: you will probably want to use some HTK tools as part of the sanity checking. You can of course run those from Python.
November 15, 2019 at 19:20 in reply to: ./scripts/results returns "unable to open label file" inconsistently #10239HTK looks for the reference labels first in the MLF(s) you provide, and failing that in the same directory as the
.rec
files that it will compare to those reference labels.First, make sure you are loading the correct MLF.
This error indicates that HTK didn’t find s2002376_test01.lab in any of the MLFs you provided. That will be because of a formatting error in your MLF. Fix that, re-run
make_mfccs
(which copies it to the data directory) and try again.Take a look at other MLFs (e.g., the one for simonk) in the main
data
directory for reference and look for differences in yours.Common mistakes include: missing quotation marks around each filename, incorrect extension on the filename, typos in the filename, one or more spaces after any of the periods, or a missing newline at the end of the file.
November 15, 2019 at 07:54 in reply to: ./scripts/results returns "unable to open label file" inconsistently #10228Solve the error with
HVite
first.Unable to open label file rec/kateb_test?(_)??.rec
means there were no
.rec
files matching that pattern.If recognition fails, then there will be no output (which
HVite
saves in in.rec
files) forHResults
to score.The
HVite
error about failing to open the audio device suggests that you are trying to run “live” recognition instead of using a stored.mfcc
file.Use
set -x
in your script and it will print out the completeHVite
command that is being run. This will help you spot what arguments are incorrect or missing. Perhaps you are not providing an.mfcc
file?I suspect you might have tried to comment out a line like this:
HInit -T 1 \ -G ESPS \ -m 1 \ -C resources/CONFIG \ -l $WORD \ -M models/hmm0 \ -o $WORD \ -L ${DATA}/lab/train \ models/proto/$PROTO \ ${DATA}/mfcc/train/simonk_train.mfcc \ # ${DATA}/mfcc/train/pilar_train.mfcc \ ${DATA}/mfcc/train/jason_train.mfcc
which is not allowed – it would be like trying to comment out the middle of a single long line.
Or, perhaps you have a space after one of your
\
which is also not allowed – this will mark the end of the commandHInit -T 1 \ -G ESPS \ -m 1 \ -C resources/CONFIG \ -l $WORD \ -S file_lists/myscriptfile.scp \ <-- IS THERE A SPACE HERE ? -M models/hmm0 \ -o $WORD \ -L ${DATA}/lab/train \ models/proto/$PROTO
Using
set -x
in a shell script will cause all commands to be printed out in full, with all their arguments completed. So, you can find the command being run and you will probably find that it is missing some of the arguments.Yes, it’s as simple as taking the first 12 values starting from the left on the horizontal axis. We call this operation “truncation”.
They are ordered (the lower ones capture broad spectral shape, higher ones capture finer and finer details), so this is theoretically well-motivated.
The DCT provides us with a representation (a set of coefficients) that has less correlation within its dimensions (the coefficients) than the filterbank outputs have between themselves.
This representation is called the cepstrum, which is in a different domain to the spectrum (which is of course in the frequency domain). We can plot the cepstrum – for example Taylor (2009) figure 12.11(c). The horizontal axis is actually time, which is what happens when you take the DCT (or the very similar inverse Fourier transform) of something in the frequency domain. But this is not the key point.
Key points to note in Taylor’s figure are
- the lower order cepstral coefficients have larger magnitudes than the higher order ones: this is telling us that we can make a good approximation to the original spectrum using just a few coefficients – that is, by truncating the cepstrum; Taylor suggests about 30 coefficients are needed, but for ASR we usually go lower, to 12
- there is a small peak in the cepstrum at about 120 samples – this corresponds to the fundamental period, T0 (= 1/F0); this tells us that truncating the cepstrum will effectively remove evidence of F0, which is what we want to do for ASR
-
AuthorPosts