Forum Replies Created
-
AuthorPosts
-
With only 500 sentences, you could record them all (assuming this comes to about the same size as Arctic A, so you could make a comparison). So, text selection would not be necessary. However, you should still measure the coverage of your text, and compare that to Arctic A (measured by you in the same way).
You need to decide on what measure(s) of coverage to use, and justify these in your report. Phonemes would be one possibility (a missing phoneme implies a large number of missing diphones – this would be bad), diphones would another. For these, you can measure the number of types covered, and express this as a percentage of the theoretical total number of types possible. Read the technical report on Arctic for more details.
But you can think of additional measures too, such as the number of questions vs statements (if that’s relevant to your domain), coverage of domain-specific vocabulary, diphones-in-context, prosodic coverage, etc. Be creative! It’s fine to report multiple measures, provided that you justify each one, and perhaps also give a critique of which you think is most useful.
Note: in general, I’d expect most postgrad students to find a large corpus and then write a text selection algorithm, whilst undergrads can select the text in an informal or manual way.
The Speech Signal Modelling video now has subtitles and a transcript.
Let me know if this helps, and post follow-up questions here.
Correct – the variance would be zero. That’s a serious problem, because such a model will assign zero probability mass to everywhere except the exact positions of the data points. Zero variance is also numerically impossible: we cannot compute with such a model.
But overfitting will probably occur long before we get to the point where there are as many mixture components as there are data points. It will happen as soon as the model starts to assign too much probability mass in the small regions around the observed data points and not enough mass to as-yet-unseen values that may occur in the test set.
The problems of small (including zero) variances in a model can be mitigated by setting a variance floor (e.g., not allowing the variance of any mixture component to go below 1% of the variance of the data as a whole). Using a variance floor is good practice because it avoids the numerical problems of very small (or zero) variances, and offers a partial solution to overfitting.
For reference, here are the standard backoff rules for the Edinburgh accent. There is no rule to allow
ii
to back off to schwa (@
) – I’m not sure why that is (but Korin will).(set! unilex-edi-backoff_rules '( (l! l) (n! n) (eir e) (iii ii) (n @) (aa @) (ae @) (i @) (irr @) (iii @) (ei @) (er @) (a @) (eir @) (uw @) (@@r @) (e @) (oo @) (our @) (ow @) (o @) (uh @) (u @) (urr @) (uuu @) (i@ @) (ur @) (hw w ) (s z) (_ #) ))
If there as many Gaussian mixture components in a GMM as there are data points, then we would expect each component to model a single data point. The mean of each component would be equal to the value of the corresponding data point.
What will the variance of each mixture component be?
You’re right that Festival varies the pronunciation of “the” in this way. In principle, you can modify the back-off rules so that when dh_ii is not found then dh_@ is selected. Making such modifications is a little beyond the scope of this assignment, but if you really want to try, then look up the function
du_voice.setDiphoneBackoff
and ask Korin (who wrote this part of Festival) for help 🙂
It’s a reasonable idea: higher sampling rates will sound better. But, unfortunately you are limited to 16kHz for this assignment.
Whenever you see the error message “cannot execute binary file” referring to a file that should never be executed, you need to look for a place where that file is the first thing on a line in the shell script.
In your case, this is within the backticks on line 11. Backticks create a new shell and the contents are passed to this new shell to be executed. In your case, the contents will be a list of wav files.
You need either:
for F in `ls ${RECDIR}/*.wav`
or
for F in ${RECDIR}/*.wav
or the fancier
for F in `find ${RECDIR} -name "*.wav"`
It’s fine to say that a constriction in the vocal tract is a source of sound, in the same way that we say the vocal folds are a source of sound.
Neither of these parts of the anatomy actually creates sound itself. They do so by changing the airflow. The vocal folds interrupt the airflow periodically. A constriction (if narrow enough) will create turbulent airflow.
If we were being super-strict with the wording, perhaps we might say that these are the locations of sound sources.
Voiced fricatives have two sound sources. The clue is in the name:
voiced = the vocal folds are vibrating.
fricative = there is turbulent airflow caused by a constriction somewhere in the vocal tract.
If we want to synthesise such a sound using a vocoder, we will need what is called “mixed excitation“, in other words, a mixture of periodic and aperiodic sources. Some very simple vocoders cannot do this, because they switch between the two sources and can’t mix them together.
Correct. EM does not guarantee “to find the maximum likelihood parameter settings given the training data” – it can only increase (or at least not decrease) the likelihood at each iteration, stopping when it reaches a local maximum.
The Euclidean distance metric is effectively the same as a Gaussian with a constant variance.
Even EM offers no guarantee to find the model parameters that maximise the likelihood of the training data (stated as “the maximum likelihood parameter settings” in the question). It can only find a local maximum, for the reasons explained in this topic. So, b. is untrue.
Yes, iv. is not true – how could it be, when we might not know anything about the test set whilst training the model?
Let’s consider the other options:
i. says that EM will find the best possible model. But we also know that it’s just an iterative “hill climbing” algorithm that stops when it cannot climb any higher. (“Height” means likelihood of the training data.). EM is also sensitive to the starting position: we may get a different final model, if we start from a different initial model. These two facts tell us that it cannot guarantee to maximise the likelihood of the training data – the best it can do is find a local maximum.
ii. we’ve already seen the this is true: hill-climbing will never take us downhill
iii.this is true by definition: EM updates all model parameters in each M step
This leads us to the correct answer of c.
You’re correct to rule out c. – this would make dynamic programming inapplicable. The same goes for d., which even requires knowledge of the future state sequence!
If b. was true, then what will happen when two tokens (in Token Passing) meet in a particular state? What if they had differing previous states (which of course they always will)?
The correct answer is a. – this is in fact a statement of the Markov property of the model.
-
AuthorPosts