Forum Replies Created
-
AuthorPosts
-
You probably used the wrong voice, or wrong dictionary, to create your labels. For example, you ran some steps with one dictionary, and other steps with a different dictionary.
This will probably be because of Apple’s over-strict security settings. The files are not damaged. Try downloading them in a browser other than Safari, or on a different computer (not in the lab).
March 3, 2018 at 10:57 in reply to: Amount of source text to start with, for my text selection algorithm #9131If you can only find 550 in-domain sentences, then you’re going to have to record pretty much all of them. So, no point doing text selection, but you should still measure the coverage.
You then propose to experiment with text selection using a bigger set of source data. That’s a good idea – just what I recommend in the previous answer above.You can measure coverage, demonstrate that your algorithm works, and so on – all good material for your report. But you perhaps don’t need to record that dataset, unless you really enjoy recording.
In general, I’d expect you to record Arctic A plus additional data of your own design amounting to about the same size as Arctic A. I think you’re proposing a third set of the same size again. If you’re efficient at recording (i.e., you get almost all sentences right in one ‘take’), and the time taken to get this data ready for voice building (e.g., sanity checking, choosing the best ‘take’) is not too much, then you could do it. But it’s definitely not essential.
March 3, 2018 at 10:09 in reply to: Amount of source text to start with, for my text selection algorithm #9129Your methodology is good: find as much domain-specific material as possible, and then use an algorithm to select the subset with best coverage.
You suggest that, because you are starting with a small amount of source text, you should select a smaller subset to record.
Actually, I would recommend selecting a subset that is the same size as Artic A (you decide how to measure ‘size’), because this would enable interesting comparisons to be made.
Starting with only 1100 sentences will limit how much your algorithm will be able to improve coverage, compared to random selection of a subset of the same size. But, it’s still a worthwhile exercise, because you’re doing all the important steps. So, go ahead.
In your report, you can acknowledge the limitations, and you could also show how much your algorithm was able to improve coverage. So, you have lots of ways to demonstrate your understanding and to get a good mark.
If you want to demonstrate that your text selection algorithm would work better given more source text, then you could run it on a much larger set (e.g., 1 million sentences) and measure coverage vs random selection. Don’t bother recording the selected sentences though – the goal is just to show that your algorithm works.
With only 500 sentences, you could record them all (assuming this comes to about the same size as Arctic A, so you could make a comparison). So, text selection would not be necessary. However, you should still measure the coverage of your text, and compare that to Arctic A (measured by you in the same way).
You need to decide on what measure(s) of coverage to use, and justify these in your report. Phonemes would be one possibility (a missing phoneme implies a large number of missing diphones – this would be bad), diphones would another. For these, you can measure the number of types covered, and express this as a percentage of the theoretical total number of types possible. Read the technical report on Arctic for more details.
But you can think of additional measures too, such as the number of questions vs statements (if that’s relevant to your domain), coverage of domain-specific vocabulary, diphones-in-context, prosodic coverage, etc. Be creative! It’s fine to report multiple measures, provided that you justify each one, and perhaps also give a critique of which you think is most useful.
Note: in general, I’d expect most postgrad students to find a large corpus and then write a text selection algorithm, whilst undergrads can select the text in an informal or manual way.
The Speech Signal Modelling video now has subtitles and a transcript.
Let me know if this helps, and post follow-up questions here.
Correct – the variance would be zero. That’s a serious problem, because such a model will assign zero probability mass to everywhere except the exact positions of the data points. Zero variance is also numerically impossible: we cannot compute with such a model.
But overfitting will probably occur long before we get to the point where there are as many mixture components as there are data points. It will happen as soon as the model starts to assign too much probability mass in the small regions around the observed data points and not enough mass to as-yet-unseen values that may occur in the test set.
The problems of small (including zero) variances in a model can be mitigated by setting a variance floor (e.g., not allowing the variance of any mixture component to go below 1% of the variance of the data as a whole). Using a variance floor is good practice because it avoids the numerical problems of very small (or zero) variances, and offers a partial solution to overfitting.
For reference, here are the standard backoff rules for the Edinburgh accent. There is no rule to allow
ii
to back off to schwa (@
) – I’m not sure why that is (but Korin will).(set! unilex-edi-backoff_rules '( (l! l) (n! n) (eir e) (iii ii) (n @) (aa @) (ae @) (i @) (irr @) (iii @) (ei @) (er @) (a @) (eir @) (uw @) (@@r @) (e @) (oo @) (our @) (ow @) (o @) (uh @) (u @) (urr @) (uuu @) (i@ @) (ur @) (hw w ) (s z) (_ #) ))
If there as many Gaussian mixture components in a GMM as there are data points, then we would expect each component to model a single data point. The mean of each component would be equal to the value of the corresponding data point.
What will the variance of each mixture component be?
You’re right that Festival varies the pronunciation of “the” in this way. In principle, you can modify the back-off rules so that when dh_ii is not found then dh_@ is selected. Making such modifications is a little beyond the scope of this assignment, but if you really want to try, then look up the function
du_voice.setDiphoneBackoff
and ask Korin (who wrote this part of Festival) for help 🙂
It’s a reasonable idea: higher sampling rates will sound better. But, unfortunately you are limited to 16kHz for this assignment.
Whenever you see the error message “cannot execute binary file” referring to a file that should never be executed, you need to look for a place where that file is the first thing on a line in the shell script.
In your case, this is within the backticks on line 11. Backticks create a new shell and the contents are passed to this new shell to be executed. In your case, the contents will be a list of wav files.
You need either:
for F in `ls ${RECDIR}/*.wav`
or
for F in ${RECDIR}/*.wav
or the fancier
for F in `find ${RECDIR} -name "*.wav"`
It’s fine to say that a constriction in the vocal tract is a source of sound, in the same way that we say the vocal folds are a source of sound.
Neither of these parts of the anatomy actually creates sound itself. They do so by changing the airflow. The vocal folds interrupt the airflow periodically. A constriction (if narrow enough) will create turbulent airflow.
If we were being super-strict with the wording, perhaps we might say that these are the locations of sound sources.
Voiced fricatives have two sound sources. The clue is in the name:
voiced = the vocal folds are vibrating.
fricative = there is turbulent airflow caused by a constriction somewhere in the vocal tract.
If we want to synthesise such a sound using a vocoder, we will need what is called “mixed excitation“, in other words, a mixture of periodic and aperiodic sources. Some very simple vocoders cannot do this, because they switch between the two sources and can’t mix them together.
Correct. EM does not guarantee “to find the maximum likelihood parameter settings given the training data” – it can only increase (or at least not decrease) the likelihood at each iteration, stopping when it reaches a local maximum.
-
AuthorPosts