Forum Replies Created
-
AuthorPosts
-
In general, females have shorter vocal tracts than males, and therefore higher formant frequencies. So iii. is true.
Harmonics are at multiples of F0. Since female speech has generally a higher F0 than male speech, the harmonics will be at multiples of a higher F0 = more widely spaced. So i. is also true.
The signal in Ladefoged Fig 6.2 could be generated by passing an impulse train with a fundamental frequency of 200 Hz through a filter which only passes through frequencies in the range 1 800 Hz to 2 200 Hz. For example, a filter with a single resonance at 2 000 Hz and a narrow bandwidth.
The harmonics in the filtered signal are still at integer multiples of the fundamental. The fundamental frequency of the filtered signal is still 200 Hz even though there is no harmonic at that frequency.
The filter cannot change the fundamental frequency. It can only modify the spectral envelope = it can only change the amplitudes of harmonics, not their frequencies.
One interesting consequence of this is that we perceive such signals as having a pitch equal to their fundamental frequency, even if there is no energy at that frequency. Our perception of pitch is based not simply on identifying the fundamental, but on the harmonic structure.
Yes, you are correct: “topology” just means the shape of the Hidden Markov Model (HMM) = how many states it has and what transitions between them are possible.
For modelling speech, a left-to-right topology is the correct choice. Speech does not time-reverse, the phones in a word must appear in the correct order, etc.
For speech, we do not generally use “parallel path” HMMs, which have transitions that allow some states to be skipped. We use strictly left-to-right models in which the only valid paths pass through all the emitting states in order.
The only exception to this might be an HMM for noise or silence in which we might add some other transitions, or connect all emitting states with all other emitting states with transitions in both directions to make an ergodic HMM.
So, in the general case, an HMM could have transitions between any pair of states, including self-transitions. That’s why, when we derive algorithms for doing computations with HMMs, we must consider all possible transitions and not restrict ourselves to a left-to-right topology.
In the simple grammar used in the assignment, we assume there is an equal (= uniform) probability of each digit. For the sequences part of the assignment, we also assume all sequences have equal probability.
But in the more general case of connected speech recognition, we will learn the prior P(W) from data. Usually that involves learning (= training) an N-gram language model from a corpus of text: the details of learning an N-gram are out-of-scope for Speech Processing, but you do need to understand that such a model is finite state and what a trained model looks like.
So, the answer to “how do we know the probability of a word sequence before we observe any acoustic evidence (= speech) ?” is that we pre-calculate and store it: that’s the language model. In the general case of an N-gram, we use data to estimate the probability of every possible N-gram in the language by counting its frequencies in a text corpus.
Our prior belief about W is P(W). When we receive the acoustic evidence O, we compute the likelihood P(O|W). We then revise (= update) our belief about W in the light of this new evidence, by multiplying the likelihood and the prior, to get P(W|O). [Ignoring P(O).]
P(W|O) is the posterior: it’s what we believe about the distribution of W given (= after receiving) the acoustic evidence O.
The three states allow the model to capture sound changes within a phone – the “beginning”, “middle” and “end”. Any number of emitting states could be used, but 3 has been found to work well and is – as you correctly state – the most common choice.
Look in the VMWare settings for “Network Adapter” and make sure it’s enabled. Play around with the settings there to see if that helps – there are different options for how the VM gets an internet connection from the host computer.
There are various trouble-shooting guides on the VMWare website depending what OS your host computer is running.
In general, we only need transcriptions without time alignments to train HMMs, including for monophone models. The method for training models in such a situation is known as “embedded training” but this is slightly beyond the scope of the course.
But in the Digit Recogniser assignment – and in the theory part of the Speech Processing course – we are using a simpler method for training HMMs which does require the data to be labelled with the start and end times of each model (which are of words, in the Digit Recogniser assignment).
You need to
rsyncthe files (as per the start of the TTS assignment, which was in Tutorial B of Module 3).For each training utterance, there will be a word-level transcription. A phone-level transcription is needed in order to determine the phone models to join together to make an utterance model. The phone transcription might be created simply by looking each word up in the dictionary and replacing it with its phone sequence.
These transcriptions do not need to have any timing information though – they are just sequences of words or phones.
Yes – there is only one model per phone (in the case of monophone models).
“Robust” means that WER is generally low across a variety of conditions. A standard, effective method for training a robust ASR system is simply to train on diverse data.
You need to answer the second part of your question by conducting experiments of your own design, which presumably will include training at least one model on diverse data and another model on less-diverse data, and a number of tests sets of your choosing.
Your point about the mean is reasonable, and perhaps suggests that a model trained on diverse data will give a higher WER than one trained on less-diverse data provided the less-diverse-data model matches the test set. But don’t forget the variances…
On average, perhaps the diverse-data-trained model will give a lower WER across a range of test sets, whereas the non-diverse-data-trained model gives a high WER on all mismatched test sets and only a low WER on the matched test set? It’s up to you to find out!
Yes, that’s correct – shared across all occurrences of that phone in the entire training data. We train one phone model on all available examples.
The implementation of this involves performing the E step for all data, “accumulating” (i.e., adding up) the necessary statistics (in fact we “accumulate” all the numerators and denominators of the M step equations, but don’t yet divide one by the other). Once all of that has been accumulated across the training data, the M step updates the model’s parameters. There is one numerator accumulator and one denominator accumulator for every individual model parameter (e.g., for the mean of each and every Gaussian).
(This implementation detail is not examinable for Speech Processing.)
The
set -xcommand in the shell scripts causes all commands to be printed (with a leading “+”) just before being executed. It’s useful for debugging, because you can see the full command line with all variable values.Simply comment it out.
December 5, 2020 at 22:37 in reply to: Creating independent features by rotating the feature space #13454Yes … but you would need to introduce an extra parameter to model that rotation. Since this is just for one Gaussian (e.g., for one HMM state), that extra parameter would only correct the rotation for its distribution. You would need to introduce an extra “rotation” parameter for every Gaussian, because it would (in general) be different in each case.
In fact, that’s exactly what covariance does. For a 2-dimensional distribution, there is just one covariance parameter (remember that the covariance matrix is symmetric), and you can think of it as that rotation parameter.
Because there is only one template (exemplar) and that doesn’t capture the natural variations in duration – it’s just one way that the word (for example) could be said.
-
AuthorPosts
This is the new version. Still under construction.