Page 10

Forum Replies Created

Viewing 15 posts - 136 through 150 (of 1,087 total)

← 1 2 3 … 9 10 11 … 71 72 73 →

Author

Posts
December 10, 2020 at 18:09 in reply to: the number of states #13530
Simon
Professor
The three states allow the model to capture sound changes within a phone – the “beginning”, “middle” and “end”. Any number of emitting states could be used, but 3 has been found to work well and is – as you correctly state – the most common choice.
December 9, 2020 at 12:26 in reply to: VM disconnected from internet #13524
Simon
Professor
Look in the VMWare settings for “Network Adapter” and make sure it’s enabled. Play around with the settings there to see if that helps – there are different options for how the VM gets an internet connection from the host computer.

There are various trouble-shooting guides on the VMWare website depending what OS your host computer is running.
December 7, 2020 at 18:09 in reply to: monophone model #13494
Simon
Professor
In general, we only need transcriptions without time alignments to train HMMs, including for monophone models. The method for training models in such a situation is known as “embedded training” but this is slightly beyond the scope of the course.

But in the Digit Recogniser assignment – and in the theory part of the Speech Processing course – we are using a simpler method for training HMMs which does require the data to be labelled with the start and end times of each model (which are of words, in the Digit Recogniser assignment).
December 7, 2020 at 18:03 in reply to: No files in /Volumes/Network/courses/sp #13492
Simon
Professor
You need to rsync the files (as per the start of the TTS assignment, which was in Tutorial B of Module 3).
December 7, 2020 at 12:49 in reply to: monophone model #13484
Simon
Professor
For each training utterance, there will be a word-level transcription. A phone-level transcription is needed in order to determine the phone models to join together to make an utterance model. The phone transcription might be created simply by looking each word up in the dictionary and replacing it with its phone sequence.

These transcriptions do not need to have any timing information though – they are just sequences of words or phones.
December 7, 2020 at 12:44 in reply to: Phone HMM within embedded training #13483
Simon
Professor
Yes – there is only one model per phone (in the case of monophone models).
December 6, 2020 at 16:31 in reply to: Robustness Definition #13463
Simon
Professor
“Robust” means that WER is generally low across a variety of conditions. A standard, effective method for training a robust ASR system is simply to train on diverse data.

You need to answer the second part of your question by conducting experiments of your own design, which presumably will include training at least one model on diverse data and another model on less-diverse data, and a number of tests sets of your choosing.

Your point about the mean is reasonable, and perhaps suggests that a model trained on diverse data will give a higher WER than one trained on less-diverse data provided the less-diverse-data model matches the test set. But don’t forget the variances…

On average, perhaps the diverse-data-trained model will give a lower WER across a range of test sets, whereas the non-diverse-data-trained model gives a high WER on all mismatched test sets and only a low WER on the matched test set? It’s up to you to find out!
December 5, 2020 at 22:44 in reply to: Phone HMM within embedded training #13456
Simon
Professor
Yes, that’s correct – shared across all occurrences of that phone in the entire training data. We train one phone model on all available examples.

The implementation of this involves performing the E step for all data, “accumulating” (i.e., adding up) the necessary statistics (in fact we “accumulate” all the numerators and denominators of the M step equations, but don’t yet divide one by the other). Once all of that has been accumulated across the training data, the M step updates the model’s parameters. There is one numerator accumulator and one denominator accumulator for every individual model parameter (e.g., for the mean of each and every Gaussian).

(This implementation detail is not examinable for Speech Processing.)
December 5, 2020 at 22:40 in reply to: Prompt BUG #13455
Simon
Professor
The set -x command in the shell scripts causes all commands to be printed (with a leading “+”) just before being executed. It’s useful for debugging, because you can see the full command line with all variable values.

Simply comment it out.
December 5, 2020 at 22:37 in reply to: Creating independent features by rotating the feature space #13454
Simon
Professor
Yes … but you would need to introduce an extra parameter to model that rotation. Since this is just for one Gaussian (e.g., for one HMM state), that extra parameter would only correct the rotation for its distribution. You would need to introduce an extra “rotation” parameter for every Gaussian, because it would (in general) be different in each case.

In fact, that’s exactly what covariance does. For a 2-dimensional distribution, there is just one covariance parameter (remember that the covariance matrix is symmetric), and you can think of it as that rotation parameter.
December 5, 2020 at 22:33 in reply to: Switching to Gaussian #13450
Simon
Professor
Because there is only one template (exemplar) and that doesn’t capture the natural variations in duration – it’s just one way that the word (for example) could be said.
December 5, 2020 at 16:14 in reply to: Language Model, Acoustic Model, Pronunciation Model #13421
Simon
Professor
The language model emits a sequence of words.

The pronunciation model emits a sequence of phonemes, given a word.

The acoustic model of a phoneme emits a sequence of observations (e.g., MFCC feature vectors).

In order to compile these together into a recognition graph, all of them must be finite state.

The most common form of language model is an N-gram. This is a finite state model that emits a word sequence. The probability of a given word sequence is the product of the probability of each word in the sequence. Those word probabilities are computed given only the N-1 preceding words.

Another form of finite state language model is the hand-crafted grammar used in the digit recogniser exercise.
December 5, 2020 at 15:39 in reply to: Token passing #13420
Simon
Professor
Young et all call token passing a “conceptual model” by which they mean a way to think about dynamic programming as well as to implement it.

The two methods in Figures 2 and 3 of the paper are equivalent – they are both computing the most likely alignment (for DTW) or state sequence (for HMMs).

The power and generality of token passing comes into play when we construct more complex models with what the paper calls a “grammar” but we could more generally call a language model. Implementing the algorithm on a matrix (we could also call this a “grid”) is just fine for an isolated word model, but quickly becomes very complicated and messy for connected words with an arbitrary language model. In contrast, token passing extends trivially to any language model, provided it is finite state (e.g., an N-gram, or the hand-crafted grammar from the digit recogniser exercise).
December 5, 2020 at 11:57 in reply to: Using -A argument #13412
Simon
Professor
In Unix (and Linux) type operating systems, when a process finishes, it exits with a status. This is 0 for success and some other number if there was an error. This status is available immediately afterwards in the shell variable $? and here is how you might test that:
```
HInit -T 0 \
    -G ESPS \
    -m 1 \
    -C resources/CONFIG \
    -l $WORD \
    -M models/hmm0 \
    -o $WORD \
    -L ${DATA}/lab/train \
    -S ${SCRIPT_FILE} \
    models/proto/$PROTO
    
if [ $? -ne 0 ]
then
    echo "HInit failed for script file "${SCRIPT_FILE}
    exit 1
fi
```
The command exit 1 means that this script will immediately exit with status 1, which you could detect in the script that called it – so, you can propagate the error back up to the top-level script (e.g., do_experiments).
December 4, 2020 at 18:06 in reply to: Suddenly can’t open files in rec #13403
Simon
Professor
grammar, word_list and the dictionary are the only things you need to change (I suggest making copies of them in different files, so you don’t break you isolated digit experimental setup). From grammar, you need to use HParse to create the finite state model, which is what HVite requires.
Author

Posts

Viewing 15 posts - 136 through 150 (of 1,087 total)

← 1 2 3 … 9 10 11 … 71 72 73 →

Simon

Forum Replies Created

Search the forums

Note

Latest Activity

Search the forums

Speech Synthesis