Forum Replies Created
-
AuthorPosts
-
Yes – there is only one model per phone (in the case of monophone models).
“Robust” means that WER is generally low across a variety of conditions. A standard, effective method for training a robust ASR system is simply to train on diverse data.
You need to answer the second part of your question by conducting experiments of your own design, which presumably will include training at least one model on diverse data and another model on less-diverse data, and a number of tests sets of your choosing.
Your point about the mean is reasonable, and perhaps suggests that a model trained on diverse data will give a higher WER than one trained on less-diverse data provided the less-diverse-data model matches the test set. But don’t forget the variances…
On average, perhaps the diverse-data-trained model will give a lower WER across a range of test sets, whereas the non-diverse-data-trained model gives a high WER on all mismatched test sets and only a low WER on the matched test set? It’s up to you to find out!
Yes, that’s correct – shared across all occurrences of that phone in the entire training data. We train one phone model on all available examples.
The implementation of this involves performing the E step for all data, “accumulating” (i.e., adding up) the necessary statistics (in fact we “accumulate” all the numerators and denominators of the M step equations, but don’t yet divide one by the other). Once all of that has been accumulated across the training data, the M step updates the model’s parameters. There is one numerator accumulator and one denominator accumulator for every individual model parameter (e.g., for the mean of each and every Gaussian).
(This implementation detail is not examinable for Speech Processing.)
The
set -x
command in the shell scripts causes all commands to be printed (with a leading “+”) just before being executed. It’s useful for debugging, because you can see the full command line with all variable values.Simply comment it out.
December 5, 2020 at 22:37 in reply to: Creating independent features by rotating the feature space #13454Yes … but you would need to introduce an extra parameter to model that rotation. Since this is just for one Gaussian (e.g., for one HMM state), that extra parameter would only correct the rotation for its distribution. You would need to introduce an extra “rotation” parameter for every Gaussian, because it would (in general) be different in each case.
In fact, that’s exactly what covariance does. For a 2-dimensional distribution, there is just one covariance parameter (remember that the covariance matrix is symmetric), and you can think of it as that rotation parameter.
Because there is only one template (exemplar) and that doesn’t capture the natural variations in duration – it’s just one way that the word (for example) could be said.
The language model emits a sequence of words.
The pronunciation model emits a sequence of phonemes, given a word.
The acoustic model of a phoneme emits a sequence of observations (e.g., MFCC feature vectors).
In order to compile these together into a recognition graph, all of them must be finite state.
The most common form of language model is an N-gram. This is a finite state model that emits a word sequence. The probability of a given word sequence is the product of the probability of each word in the sequence. Those word probabilities are computed given only the N-1 preceding words.
Another form of finite state language model is the hand-crafted grammar used in the digit recogniser exercise.
Young et all call token passing a “conceptual model” by which they mean a way to think about dynamic programming as well as to implement it.
The two methods in Figures 2 and 3 of the paper are equivalent – they are both computing the most likely alignment (for DTW) or state sequence (for HMMs).
The power and generality of token passing comes into play when we construct more complex models with what the paper calls a “grammar” but we could more generally call a language model. Implementing the algorithm on a matrix (we could also call this a “grid”) is just fine for an isolated word model, but quickly becomes very complicated and messy for connected words with an arbitrary language model. In contrast, token passing extends trivially to any language model, provided it is finite state (e.g., an N-gram, or the hand-crafted grammar from the digit recogniser exercise).
In Unix (and Linux) type operating systems, when a process finishes, it exits with a status. This is 0 for success and some other number if there was an error. This status is available immediately afterwards in the shell variable
$?
and here is how you might test that:HInit -T 0 \ -G ESPS \ -m 1 \ -C resources/CONFIG \ -l $WORD \ -M models/hmm0 \ -o $WORD \ -L ${DATA}/lab/train \ -S ${SCRIPT_FILE} \ models/proto/$PROTO if [ $? -ne 0 ] then echo "HInit failed for script file "${SCRIPT_FILE} exit 1 fi
The command
exit 1
means that this script will immediately exit with status 1, which you could detect in the script that called it – so, you can propagate the error back up to the top-level script (e.g.,do_experiments
).grammar, word_list and the dictionary are the only things you need to change (I suggest making copies of them in different files, so you don’t break you isolated digit experimental setup). From grammar, you need to use
HParse
to create the finite state model, which is whatHVite
requires.user_test...
doesn’t look like a valid file name – shouldn’t there be a username there? Are there any files in yourrec
directory? Did you get any earlier errors?The bank of filters is one stage in extracting useful features from a frame of speech waveform. The diagram you show is a simple bank of rectangular filters, which would work, but in practice we typically use triangular overlapping filters.
The energy in each band is one feature. It is the total amount of energy in that frequency band, for that frame of speech. The number of filters determines how many features we extract (typically 20-30).
HResults
uses dynamic programming to align the recognition output and the reference transcription. So, WER is simply the edit distance between recognition output and reference transcription.There are three possible types of error: substitutions, insertions and deletions. WER is just the sum of those three, divided by the number of words in the reference transcription, and expressed as a percentage.
For the special case of isolated words, the only possible type of error is a substitution error, and the dynamic programming is not really needed.
Note that
HResults
reports “Accuracy (Acc)”, but you should only use WER (100 – Acc) in your report.Ignore the value of “Correct (Corr)” reported by
HResults
– this does not account for insertion errors and is not a measure used anymore.Some LaTeX classes (depending on which .cls file you are using) have
\paragraph{Heading} Text starts here...
as a lower-level heading. You could use that, probably un-numbered (so use\paragraph*{Heading}
if your class file numbers paragraphs).Summary of positive points – these are things we will keep doing:
course structure and content – modules, interactive tutorials and live classes, PHON material, faster pacing than typical linguistics courses
videos – informative, clear, simple, visual, transcripts, route maps
notebooks – maths in depth, weekly exercises, animations
tutorials – twice per week, use of new/complementary slides and explanations, writing skills
teaching staff – accessible, helpful
speech.zone – generally preferred over Learn, well-organised, easy to search
-
AuthorPosts