Page 28

Forum Replies Created

Viewing 15 posts - 406 through 420 (of 1,087 total)

← 1 2 3 … 27 28 29 … 71 72 73 →

Author

Posts
November 22, 2019 at 07:43 in reply to: How to reference HTK in the report #10310
Simon
Professor
You should reference the manual for whatever version you are using (you can find that by running any HTK program with just the -V flag).
November 21, 2019 at 12:18 in reply to: Viterbi vs Forward-Backward (Baum-Welch) #10305
Simon
Professor
Both algorithms find alignments between states and observations.

Both algorithms express this alignment as the probability of an observation aligning with a state, which is the same thing as the state sequence. In Viterbi, we can think of those probabilities being “hard” or 1s and 0s because there just one state sequence is considered. In Baum-Welch, they will be “soft” because all state sequences are considered.

These probabilities are then used as the weights in a weighted sum of observations to re-estimate the means of the Gaussians. The weights will not sum to one, so this weighted sum must be normalised by dividing by the sum of the weights.

Remember that in this course, we’re not deriving the equations to express the above. So bear in mind that whilst the concept of “hard” and “soft” alignment is perfectly correct, the exact computations of the weights might be slightly more complex.

In training, W is constant for each training sample, so no language model is needed.

The objective function of both algorithms when used for training is to maximise the probability of the observations given the model: P(O|W). This is called “maximum likelihood training” and is the simplest and most obvious thing to do.

(However, this simple objective function does not directly relate to the final task, which is one of classification. So, in advanced courses on ASR, other objective functions would be developed which more directly relate to minimising the recognition error rate. Those are much more complex.)
November 20, 2019 at 09:20 in reply to: HMM model in HTK #10297
Simon
Professor
The states in a whole word model do not correspond to phonemes. Figure 9.4 in Jurafsky & Martin (2nd edition) implies this is the case but what they are doing is constructing a word model from sub-word (phoneme) models, and their phoneme models have a single state (which is not common – normally we use 3 states). The figure is misleading.

The number of emitting states in a model is a design choice we need to make. As you correctly say, more states means we will need more training data, because the model will have more parameters.

In the digit recogniser assignment, there are a variety of “prototype” models that have varying numbers of states, for you to experiment with. It’s certainly worth doing an experiment to investigate this; make sure it’s one using large training and test sets, not just a single speaker.

You could try using a different number of states for each digit in the vocabulary, but that’s probably not the most fruitful line of experiments.
November 18, 2019 at 08:09 in reply to: Error changing file permissions #10273
Simon
Professor
First, you can and should only run make_mfccs for your own data. (The only exception would be an advanced experiment varying the parameterisation, and you should talk that over with me in a lab session before attempting it).

There are ongoing permissions problems on the server, and so I’ve reset them again. Please try again and report back.
November 18, 2019 at 08:00 in reply to: Training a model with multiple people’s data #10271
Simon
Professor
Each model is of one digit. There are always 10 models (look in the models directory to see them). This is the same, regardless of what data you train the models on.

You need to run HInit (and then HRest) once for each model you want to train. The basic scripts do this already for you, using a for loop, and you’ll keep that structure for speaker-independent experiments too.
November 17, 2019 at 16:00 in reply to: Sanity-checking script returns higher WER than run_experiments #10260
Simon
Professor
```
HVite -T 1 -C resources/CONFIG \
-d models/hmm1 \
-l rec \
-w resources/grammar_as_network \
resources/dictionary \
resources/word_list \
-S file_lists/testingscriptfile.scp \
```
is wrong – the -S argument must go before the dictionary and word_list. You should be getting an error message from this that you have missed, I suspect. Also, remove the final “\”, because that means “continues on the next line”.

Even if that fixes the problem, you should still put in more error checking to all your scripts, to be confident everything is working correctly.

You’ve done one obvious thing: deleting all trained HMMs in both hmm0 and hmm1 before every experiment. You can also delete everything in the rec directory too.

Then start checking things such as
- Training is finding the expected number of examples of each digit, in both training phases
- You are using the exact same data for HInit and HRest
- Recognition is processing the expected number of files
- Results scoring is finding the expected number of examples of each digit
You really should get identical results, so it’s worth tracking this bug down before proceeding.

(It might be co-incidence, but 10 is twice 5 and 3.33 is twice 1.67.)
November 16, 2019 at 14:18 in reply to: Training a model with multiple people’s data #10248
Simon
Professor
Neither.

Think about what it means to train a model: to estimate its parameters from some data. If you want to model all the data, then the training algorithm will need all of that data at once.

So, to train a model of any given digit, on any data set (whether from one person or many), we run HInit once, and HRest once, loading all of the data.

We repeat that for each digit model we need (there are for loops in the scripts that do that).
November 15, 2019 at 22:49 in reply to: Holmes & Holmes – Chapter 8 #10245
Simon
Professor
The figure uses a non-linear frequency scale on the vertical axis.

The lower frequency channels are closer together on a linear frequency scale and this is what the figure shows.

Compare the 3 channels covering the lower frequency range of 0 Hz to 1 kHz (which is a range of 1 kHz) with the 3 channels covering the higher frequency range from 3 kHz to 5kHz (which is a range of 2 kHz). The ones in the lower frequency range are closer together (i.e., there are more “channels per kHz”) than those in the higher range.
November 15, 2019 at 19:22 in reply to: Automated sanity checking #10240
Simon
Professor
Yes, you can use whatever language you like – we will not be marking your code, only your lab report.

Hint: you will probably want to use some HTK tools as part of the sanity checking. You can of course run those from Python.
November 15, 2019 at 19:20 in reply to: ./scripts/results returns "unable to open label file" inconsistently #10239
Simon
Professor
HTK looks for the reference labels first in the MLF(s) you provide, and failing that in the same directory as the .rec files that it will compare to those reference labels.

First, make sure you are loading the correct MLF.

This error indicates that HTK didn’t find s2002376_test01.lab in any of the MLFs you provided. That will be because of a formatting error in your MLF. Fix that, re-run make_mfccs (which copies it to the data directory) and try again.

Take a look at other MLFs (e.g., the one for simonk) in the main data directory for reference and look for differences in yours.

Common mistakes include: missing quotation marks around each filename, incorrect extension on the filename, typos in the filename, one or more spaces after any of the periods, or a missing newline at the end of the file.
November 15, 2019 at 07:54 in reply to: ./scripts/results returns "unable to open label file" inconsistently #10228
Simon
Professor
Solve the error with HVite first.
```
Unable to open label file rec/kateb_test?(_)??.rec
```
means there were no .rec files matching that pattern.

If recognition fails, then there will be no output (which HVite saves in in .rec files) for HResults to score.

The HVite error about failing to open the audio device suggests that you are trying to run “live” recognition instead of using a stored .mfcc file.

Use set -x in your script and it will print out the complete HVite command that is being run. This will help you spot what arguments are incorrect or missing. Perhaps you are not providing an .mfcc file?
November 14, 2019 at 11:13 in reply to: source HMM filename #10223
Simon
Professor
I suspect you might have tried to comment out a line like this:
```
HInit -T 1 \
	-G ESPS \
        -m 1 \
        -C resources/CONFIG \
        -l $WORD \
        -M models/hmm0 \
        -o $WORD \
        -L ${DATA}/lab/train \
	models/proto/$PROTO \
	${DATA}/mfcc/train/simonk_train.mfcc \
#	${DATA}/mfcc/train/pilar_train.mfcc \
	${DATA}/mfcc/train/jason_train.mfcc
```
which is not allowed – it would be like trying to comment out the middle of a single long line.

Or, perhaps you have a space after one of your \ which is also not allowed – this will mark the end of the command
```
HInit -T 1 \
	-G ESPS \
        -m 1 \
        -C resources/CONFIG \
        -l $WORD \
        -S file_lists/myscriptfile.scp \ <-- IS THERE A SPACE HERE ?
        -M models/hmm0 \
        -o $WORD \
        -L ${DATA}/lab/train \
	models/proto/$PROTO 
```
Using set -x in a shell script will cause all commands to be printed out in full, with all their arguments completed. So, you can find the command being run and you will probably find that it is missing some of the arguments.
November 13, 2019 at 10:04 in reply to: Jurafsky & Martin – Chapter 9 #10213
Simon
Professor
Yes, it’s as simple as taking the first 12 values starting from the left on the horizontal axis. We call this operation “truncation”.

They are ordered (the lower ones capture broad spectral shape, higher ones capture finer and finer details), so this is theoretically well-motivated.
November 12, 2019 at 12:47 in reply to: Spectrum plot after DCT #10206
Simon
Professor
The DCT provides us with a representation (a set of coefficients) that has less correlation within its dimensions (the coefficients) than the filterbank outputs have between themselves.

This representation is called the cepstrum, which is in a different domain to the spectrum (which is of course in the frequency domain). We can plot the cepstrum – for example Taylor (2009) figure 12.11(c). The horizontal axis is actually time, which is what happens when you take the DCT (or the very similar inverse Fourier transform) of something in the frequency domain. But this is not the key point.

Key points to note in Taylor’s figure are
- the lower order cepstral coefficients have larger magnitudes than the higher order ones: this is telling us that we can make a good approximation to the original spectrum using just a few coefficients – that is, by truncating the cepstrum; Taylor suggests about 30 coefficients are needed, but for ASR we usually go lower, to 12
- there is a small peak in the cepstrum at about 120 samples – this corresponds to the fundamental period, T0 (= 1/F0); this tells us that truncating the cepstrum will effectively remove evidence of F0, which is what we want to do for ASR
November 9, 2019 at 17:29 in reply to: Taylor – Chapter 12 #10184
Simon
Professor
Taylor is separating it out here because he is trying to show how the equations align with the physics of sound propagation in the vocal tract.

Lip radiation can be assumed a constant effect: effectively, a filter that boosts high frequencies. This filtering effect is independent of the configuration of the articulators (Taylor, 2009, equation 11.29).

Furthermore, the constant high-pass filtering effect of lip radiation is more than cancelled out by another constant effect of low-pass filtering at the sound source:

It is this, combined with the radiation effect, that gives all speech spectra their characteristic spectral slope.

(Taylor, 2009, page 332)

So, we don’t need any learnable model parameters for these effects. We can account for them either by absorbing this constant effect into the vocal tract filter (which might be modelled using linear prediction) or by pre-emphasising the signal in the time domain (Taylor, 2009, page 375) to make its spectrum flatter, before any subsequent modelling, processing or feature extraction.

Pre-emphasis is standard practice in most speech processing – can you find where this is done in the digit recogniser ?
Author

Posts

Viewing 15 posts - 406 through 420 (of 1,087 total)

← 1 2 3 … 27 28 29 … 71 72 73 →

Simon

Forum Replies Created

Search the forums

Note

Latest Activity

Search the forums

Speech Synthesis