Forum Replies Created
-
AuthorPosts
-
This is a new feature of the latest version of MacOS (Catalina onwards) – Apple have changed the default shell. But this is not an error message – just information.
The (base) in your shell prompt suggests that you haven’t activated the Python virtual environment, so try:
(base) $ conda activate slp
and you should see your prompt change from (base) to (slp). Now try
(slp) $ jupyter notebook
OK, that set-up should be fine. The VM image has not changed. Just complete all the testing in Module 0 and make sure it all works, then you are good to go.
Those don’t look like important errors, so if you’ve completed all tasks then you’re done.
Which operating system and host software are you using? We did some testing with VirtualBox but are now recommending VMWare.
I don’t think there is an electronic version of this book. It’s good and cheap enough to be worth buying, otherwise the main library has multiple copies.
Festival provides functions to change a variety of weights, including those within the join cost, but not those within the target cost (which are defined in code). See this post.
festival_mac
indicates the problem – the cause of this mysterious random change to the PATH is currently unknown, but there are some workarounds here.We are doing what is called “flat start” training, which means going directly from “flat” models (i.e., with all the means set to 0 and the variances set to 1) directly to the Baum-Welch algorithm with data of complete utterances. This means we do not need to label the start and end of either words or phones (in contrast to the digit recogniser exercise, where we did hand-label the training data).
HResults needs another file, listing the valid labels, so you need to do:
$ HResults -p -I ./reference.mlf wordlist rec/intel*.rec
where the file wordlist contains a list of all possible words that could be found in the transcriptions or rec files
I’ve also checked, and the dummy timestamps are not necessary: the rec files can just have one word per line.
Here’s one way to make the wordlist file, assuming rec files with one word per line and no timestamps:
$ cat reference.mlf rec/intel*.rec | egrep -v '#|"|\.' | sort -u > wordlist
Format of the reference MLF should be:
#!MLF!# "*/intel1.lab" word1 word2 word3 . “*/intel2.lab” word1 word2 .
with a final newline at the end of the file. The format of each rec file should also be one word per line (and you might need dummy start/end time?) – look at your rec files from the Speech Processing digit recogniser assignment.
Are you sure you have loaded your unit selection voice? Your output looks like that from a diphone voice, such as the default voice when you first start Festival.
For a unit selection voice, you should see something like this when inspecting the
Unit
relation:id _22 ; name #_h ; ph1 “[Val item]” ; sig “[Val wave]” ; coefs “[Val track]” ; middle_frame 10 ; source_utt arctic_a0379 ; source_ph1 “[Val item]” ; source_end 0.202 ; target_cost 0.0625 ; join_cost 0 ; end 0.136313 ; num_frames 14 ;
where source_utt tells you where the selected unit came from.
Taylor wrote that from his experience building two commercial systems, which were successors to Festival.
Festival doesn’t do anything to vary the components of the join cost, beyond the special case of one diphone being voiced at the join point and the other unvoiced (according to estimated F0).
The use of separate labels for closure and burst in plosives is only for forced alignment. It allows the join point to be placed reliably at the midpoint of the closure (the midpoint of the entire segment would sometimes be in the closure, sometimes in the burst, leading to synthesised plosives with 0, 1, or 2 bursts).
You’re experiencing the downside of unit selection – instability and unpredictable behaviour.
It’s also not necessarily the case that more powerful acoustic models provide a more accurate alignment.
Instead of forcing unit choices, you might instead think of some evidence that you can present to back up your claim that listeners were “responding to the choice of units”. This could be informal / qualitative / small scale / based on your own listening or analysis – no need for another listening test.
The only special case where you might want to do this is when demonstrating the effect of fixing a labelling error. The instructions suggest restricting the database to only the utterances providing the desired units. (This is not guaranteed to keep the unit sequence the same, if there are multiple instances of some diphones, but usually does the trick.)
However, it’s not possible in general to impose this constraint on a full voice. Given that almost any change in a voice (e.g., different F0 estimation settings) will change either the join or target costs, you may get a different unit sequence. Constraining the unit sequence effectively means ignoring join and target costs.
Why do you want to do that, in your case?
Yes, more components gives a more expressive probability density function that can better fit the data. As in all machine learning, having too many model parameters (here, the number of mixture components controls the number of means and variances to estimate) can lead to overfitting. That’s probably not the main concern here, since we are not trying to generalise from training data to test data.
Try limiting the number of components to 1, as a way to get potentially worse alignments. Another way to get worse alignments would be to reduce the amount of data used to train the models, whilst still aligning all the data in the final run of
HVite
in the script. -
AuthorPosts