Page 12

Forum Replies Created

Viewing 15 posts - 166 through 180 (of 1,104 total)

← 1 2 3 … 11 12 13 … 72 73 74 →

Author

Posts
December 5, 2020 at 16:14 in reply to: Language Model, Acoustic Model, Pronunciation Model #13421
Simon King
Professor
The language model emits a sequence of words.

The pronunciation model emits a sequence of phonemes, given a word.

The acoustic model of a phoneme emits a sequence of observations (e.g., MFCC feature vectors).

In order to compile these together into a recognition graph, all of them must be finite state.

The most common form of language model is an N-gram. This is a finite state model that emits a word sequence. The probability of a given word sequence is the product of the probability of each word in the sequence. Those word probabilities are computed given only the N-1 preceding words.

Another form of finite state language model is the hand-crafted grammar used in the digit recogniser exercise.
December 5, 2020 at 15:39 in reply to: Token passing #13420
Simon King
Professor
Young et all call token passing a “conceptual model” by which they mean a way to think about dynamic programming as well as to implement it.

The two methods in Figures 2 and 3 of the paper are equivalent – they are both computing the most likely alignment (for DTW) or state sequence (for HMMs).

The power and generality of token passing comes into play when we construct more complex models with what the paper calls a “grammar” but we could more generally call a language model. Implementing the algorithm on a matrix (we could also call this a “grid”) is just fine for an isolated word model, but quickly becomes very complicated and messy for connected words with an arbitrary language model. In contrast, token passing extends trivially to any language model, provided it is finite state (e.g., an N-gram, or the hand-crafted grammar from the digit recogniser exercise).
December 5, 2020 at 11:57 in reply to: Using -A argument #13412
Simon King
Professor
In Unix (and Linux) type operating systems, when a process finishes, it exits with a status. This is 0 for success and some other number if there was an error. This status is available immediately afterwards in the shell variable $? and here is how you might test that:
```
HInit -T 0 \
    -G ESPS \
    -m 1 \
    -C resources/CONFIG \
    -l $WORD \
    -M models/hmm0 \
    -o $WORD \
    -L ${DATA}/lab/train \
    -S ${SCRIPT_FILE} \
    models/proto/$PROTO
    
if [ $? -ne 0 ]
then
    echo "HInit failed for script file "${SCRIPT_FILE}
    exit 1
fi
```
The command exit 1 means that this script will immediately exit with status 1, which you could detect in the script that called it – so, you can propagate the error back up to the top-level script (e.g., do_experiments).
December 4, 2020 at 18:06 in reply to: Suddenly can’t open files in rec #13403
Simon King
Professor
grammar, word_list and the dictionary are the only things you need to change (I suggest making copies of them in different files, so you don’t break you isolated digit experimental setup). From grammar, you need to use HParse to create the finite state model, which is what HVite requires.
December 4, 2020 at 16:09 in reply to: Suddenly can’t open files in rec #13401
Simon King
Professor
user_test... doesn’t look like a valid file name – shouldn’t there be a username there? Are there any files in your rec directory? Did you get any earlier errors?
December 4, 2020 at 13:59 in reply to: Bandpass #13377
Simon King
Professor
The bank of filters is one stage in extracting useful features from a frame of speech waveform. The diagram you show is a simple bank of rectangular filters, which would work, but in practice we typically use triangular overlapping filters.

The energy in each band is one feature. It is the total amount of energy in that frequency band, for that frame of speech. The number of filters determines how many features we extract (typically 20-30).
December 3, 2020 at 18:40 in reply to: HResults: Dynamic Programming #13359
Simon King
Professor
HResults uses dynamic programming to align the recognition output and the reference transcription. So, WER is simply the edit distance between recognition output and reference transcription.

There are three possible types of error: substitutions, insertions and deletions. WER is just the sum of those three, divided by the number of words in the reference transcription, and expressed as a percentage.

For the special case of isolated words, the only possible type of error is a substitution error, and the dynamic programming is not really needed.

Note that HResults reports “Accuracy (Acc)”, but you should only use WER (100 – Acc) in your report.

Ignore the value of “Correct (Corr)” reported by HResults – this does not account for insertion errors and is not a measure used anymore.
December 3, 2020 at 18:35 in reply to: Subsubsubsection #13355
Simon King
Professor
Some LaTeX classes (depending on which .cls file you are using) have \paragraph{Heading} Text starts here... as a lower-level heading. You could use that, probably un-numbered (so use \paragraph*{Heading} if your class file numbers paragraphs).
December 3, 2020 at 16:06 in reply to: Response to Speech Processing feedback of 2020-10 #13351
Simon King
Professor
Summary of positive points – these are things we will keep doing:

course structure and content – modules, interactive tutorials and live classes, PHON material, faster pacing than typical linguistics courses

videos – informative, clear, simple, visual, transcripts, route maps

notebooks – maths in depth, weekly exercises, animations

tutorials – twice per week, use of new/complementary slides and explanations, writing skills

teaching staff – accessible, helpful

speech.zone – generally preferred over Learn, well-organised, easy to search
December 1, 2020 at 18:37 in reply to: Stop VM suspending after inacitivy #13325
Simon King
Professor
Are you sure it is actually suspending and not just locking the screen?

Try running this simple process in the terminal that will let you see if the screen is locked (and still running) or actually suspended:
```
while true; do date; sleep 5; done
```
Wait 5 minutes to let the screen lock. Unlock it. If the machine had suspended, you will see a gap in the times printed. For me, this shows that the machine keeps running when the screen is locked.

You can change the time before the screen locks, if you like.
December 1, 2020 at 14:40 in reply to: Correlation of MFCCs #13321
Simon King
Professor
There is just less correlation than between filterbank outputs. So much less that we assume there is none (or at least, none worth modelling)!

In reality, there is of course some remaining correlation. In an advanced Automatic Speech Recognition course we would look at ways to model covariance without having a full covariance matrix for each and every Gaussian (because that would be too many parameters). We might share covariances between Guassians, or do clever things with the diagonal vs. off-diagonal entries. But all of that is well beyond the scope of the Speech Processing course.
November 30, 2020 at 10:43 in reply to: HMM training #13304
Simon King
Professor
Before commencing Viterbi training, the model must have some parameters. These could come from uniform segmentation, for example. These parameters will not be optimal: they will not be the parameters that maximise the probability of the training data.

For that initial model, we use the Viterbi algorithm to find an alignment between model and training data. This alignment is the best possible one, given the current model’s parameters (which, remember, are not optimal at this stage). Because the model is not yet optimal, this alignment will not necessarily be the best either.

This alignment is used to update the model parameters. The model is now better: it will generate the training data with a higher probability than with the previous model parameters.

Because the model is better, it will now be able to find a more probable alignment with the training data than the previous model. So we re-align the data, then use this new alignment to update the model parameters.

This can be repeated (iterated) a number of times. We stop when the model no longer improves, as measured by the probability of it generating the training data.
November 30, 2020 at 08:35 in reply to: Cepstral analysis #13302
Simon King
Professor
The vertical axis on the cepstrum is the value of the cepstral co-efficients. It could be labelled “amplitude” or “magnitude”. Sometimes we use the absolute magnitude for the purposes of plotting so that everything is positive.
November 30, 2020 at 08:26 in reply to: Viterbi Criterion. #13300
Simon King
Professor
We need to clearly separate the model from the algorithm,

The HMM is the model. It has no memory other than the state, which has to encapsulate all information required to do computation (e.g., to generate an observation).

There are various algorithms available to perform computations with this model. These all take advantage of the memoryless nature of the model in order to simplify that computation.

The Viterbi algorithm computes the single most likely state sequence (= path through the model) to have generated a given observation sequence. The key step in the Viterbi algorithm is to compare all the paths arriving at a particular state at a particular time and keep only the most probable. This is possible because we do not need to know anything more about those paths other than the fact they are all in the same state.
November 29, 2020 at 11:27 in reply to: E-step in EM algorithm #13284
Simon King
Professor
The Expectation (E) step computes all the statistics necessary, such as state occupancy probabilities. A simple way to think of this is averaging. This is where all the computation happens because we are considering all possible state sequences.

The Maximisation (M) step updates the model parameters, using those statistics, to maximise the probability of the training data. This is a very simple equation requiring very little computation.

In theoretical explanations of Expectation-Maximisation, which for HMMs is called the Baum-Welch algorithm, the E step is typically defined as computing the statistics. For example, we compute the state occupancy probabilities so that the M step can use them as the weights in a weighted sum of the observations (i.e,, the training data).

In a practical implementation of the E step, we actually compute the weighted sum as we go. We create a simple data structure called an accumulator which has two parts: one to sum up the numerator and the other to sum up the denominator, of each M step equation (e.g., for the mean of a particular Gaussian, Jurafsky and Martin 2nd edition equation 9.35 in Section 9.4.2). There will be one accumulator for each model parameter. The M step is then simply to divide numerator by denominator and update the model parameter (and to do that for each model parameter).

For Speech Processing, aim for a conceptual understanding of Expectation-Maximisation. You would not need to reproduce these equations under exam conditions.

Now that I’ve mentioned accumulators, here is something right at the edge of the course coverage:

For large-vocabulary connected speech recognition, we model phonemes. If there is enough data, we can make these models context-dependent and thus get a lower WER. The context is typically the left and right phoneme, and such models are called triphones. We need to share parameters between clusters of model parameters because there won’t be enough (or sometimes any) training examples for certain triphones. This is called “tying”. It turns out to be very easy to implement training for tied triphones: we just make a single accumulator for each cluster of shared model parameters.
Author

Posts

Viewing 15 posts - 166 through 180 (of 1,104 total)

← 1 2 3 … 11 12 13 … 72 73 74 →

Simon King

Forum Replies Created

Search the forums

Note

Latest Activity

Search the forums

Speech Synthesis