Forum Replies Created
-
AuthorPosts
-
Distance is what we use in pattern matching. A smaller distance is better: it means that patterns are more similar. The Euclidean distance is an example of a distance metric.
Probability is what we use in probabilistic modelling. We use probability density rather than probability mass when modelling a continuous variable with a probability density function. The Gaussian is an example of a probability density function (pdf).
When we use a generative model, such as a Gaussian, to compute the probability of emitting an observation given the model (= conditional on the model) we are calculating a conditional probability density, which we call the likelihood.
A larger probability, probability density, or likelihood is better: it indicates that a model is more likely to have generated the observation.
To do classification, we only ever compare distances or likelihoods between models. We don’t care about the absolute value, just which is smallest (lowest) or largest (highest), respectively.
The log is a monotonically increasing function, so does not change anything with respect to comparing values. We take the log only for reasons of numerical precision.
It doesn’t matter that probability densities are not necessarily between 0 and 1; they are always positive, so we can always take the log. A higher probability density leads to a higher log likelihood.
Taking the negative simply inverts the direction for comparisons. We might use negative log likelihoods in specific speech technology applications when it feels more natural to have a measure that behaves like a distance (smaller is better).
In general, for Automatic Speech Recognition, we use log likelihood. This is what HTK prints out, for example. Those numbers will almost always be negative in practice, but a positive log likelihood is possible in theory because a likelihood can be greater than one when it is computed using a probability density.
Yes, HMMs can be used for speech synthesis. We would not use MFCCs because they are not invertible, due to the filterbank. We might use something closer to the true cepstrum.
Pre-emphasis is primarily to avoid numerical problems in taking the DFT of a signal with very different amounts of energy at different frequencies. Using too much pre-emphasis might have negative consequences by over-boosting high frequencies, but typically only modest pre-emphasis is applied.
Pre-emphasis will boost any non-speech at higher frequencies, yes. So, if that is noise then pre-emphasis will make the Signal-to-Noise Ratio (SNR) worse. Again, this should not be a problem with the modest amount of pre-emphasis typically used.
Try this (again, I’m using
echo
just as an example of a program that produces output onstdout
– you are probably trying to capture the output of a more interesting program):$ echo -e "this is 1\nand now 2" $ echo -e "this is 1\nand now 2" | grep this $ echo -e "this is 1\nand now 2" | grep this | cut -d" " -f3
cut
cuts vertically and the above arguments say “define the delimiter as the space character, cut into fields using that delimiter, and give me the third field”To capture the output of a program, or of a pipeline of programs as we have above, we need to run it inside “backticks”. So, let’s capture the output of that pipeline of commands and store it in a shell variable called MYVAR:
$ MYVAR=`echo -e "this is 1\nand now 2" | grep this | cut -d" " -f3` $ echo The value is: ${MYVAR}
Unix programs print output on one or both of two output streams called
stdout
(standard out) andstderr
(standard error). The former is meant for actual output and the latter for error messages, although programs are free to use either for any purpose.Here’s a program that prints to
stdout
for testing purposes: it’s just theecho
command:$ echo -e "the first line\nthe second line"
which will print this to
stdout
the first line the second line
Now let’s capture that and do something useful with it. We ‘pipe’ the output to the next program using “|” (pronounced “pipe”):
$ echo -e "the first line\nthe second line" | grep second
which gives
the second line
where
grep
finds the pattern of interest. Or how about cutting vertically to get a certain character range$ echo -e "the first line\nthe second line" | cut -c 5-9
which gives
first secon
Now combine them:
$ echo -e "the first line\nthe second line" | grep first | cut -c 5-9
to print only
first
At first, yes, make it by hand so you get the format right.
Eventually, you’ll want to make it from a list of USER names. The forum on shell scripting has some tips on how you might do this using a for loop that reads the USER list from one file and creates an HTK script in another file.
a script file (e.g., it might be called
train.scp
) would look something like this – a list of MFCC files with full paths:/Volumes/Network/courses/sp/data/mfcc/train/s1764494_train.mfcc /Volumes/Network/courses/sp/data/mfcc/train/s1766810_train.mfcc /Volumes/Network/courses/sp/data/mfcc/train/s1770642_train.mfcc
You pass this using the -S flag like this
HRest ...other command line args here... \ -S train.scp \ models/hmm0/$WORD
noting that when you use
-S
you no longer pass any MFCC files as command line arguments.You are probably running the script from inside the scripts directory. The scripts are very simple and only work from the main directory. You should run them like this:
$ pwd /home/atlab/Documents/sp/digit_recogniser $ ./scripts/initialise_models
Your script should be simply
#!/usr/bin/bash ./scripts/initialise_models ./scripts/train_models
You don’t need the
$
– those represent the bash prompt in the instructions.Your line
SCRIPT_PATH="/Documents/sp/digit_recogniser/"
is not needed because you don’t use the shell variableSCRIPT_PATH
anywhere later in the script. If you did need this, then the path should most probably be~/Documents/sp/digit_recogniser/
where~
is shorthand for your home directory.Look under the “Class” tab for Module 7. The “Live class” item now has written notes.
Let me know if this is useful and I will add this for other modules.
Yes, that’s a reasonable approach.
In this classic paper on voice conversion they use the cepstrum to represent the spectral envelope rather than LPC co-efficients.
In this paper, we do use linear prediction as the parameterisation of the spectral envelope. But we don’t use the co-efficients of the difference equation (the LPC parameters) directly – we transform them to another representation called line spectral frequencies (LSFs) for reasons explained in the paper.
Rather than use the source speaker’s residual (which is one option), we predict the residual from the converted spectral envelope.
If you are Edinburgh, please remember that there are plenty of hardcopies for loan in the main library.
Yes – you can read waveforms!
Yes – this is a very reasonable proposition. Like many good ideas, it has been tried. Here’s a paper from Tokuda (most famous for speech synthesis) et al on what they call Mel-Generalized Cepstral Analysis – this also shows the relationship between the cepstrum and LPC analysis. (This is well-beyond the scope of the Speech Processing course!)
I have sent you a personal message in Teams. Any other students in the same situation should contact me for personal advice.
-
AuthorPosts