Forum Replies Created
-
AuthorPosts
-
Correct – the value is a beam width (wider = less pruning) and takes values between 0 and 1. The default values are set in the file …/festival/lib/multisyn/multisyn.scm
I’ve added more information on pruning to the exercise.
I’ve realised there is indeed a run-time interface to all of the various join and target cost weights and beam widths, etc. I had originally thought that these were deprecated and values were compiled in to the code, but I was wrong.
See the full list of functions – look for those that start “du_” (which means “diphone unit”)
This should be simpler than what you’re doing above.
We’ll look at this in the lecture.
Yes, there is some pruning of the candidates before search commences, then more pruning during the Viterbi search.
Some of the relevant functions within Festival are as follows:
festival> (du_voice.set_tc_rescoring_beam currentMultiSynVoice 0.5) festival> (du_voice.set_tc_rescoring_weight currentMultiSynVoice 3.0) festival> (du_voice.set_ob_pruning_beam currentMultiSynVoice 0.3) festival> (du_voice.set_pruning_beam currentMultiSynVoice 0.3)
which you execute after loading a multisyn voice. Note that you use them literally as above, with the “currentMultiSynVoice” argument exactly as written (i.e., don’t replace that with the name of your voice).
See the full list of functions – look for those that start “du_” (which means “diphone unit”)
As you make the beam sizes smaller, the speech will gradually get worse. For very small numbers in some cases, you may prevent any sequence being found, and get the error message “No best candidate sequence found”.
Yes, I think that would work. Changing the normalisation of the join cost coefficients (not just MFCCs – also F0 and energy) effectively changes the relative weight between join cost and target cost.
Try making the join cost coeffs very small – you should get more joins (fewer contiguous sequences of candidate), and therefore presumably more bad joins.
Try making them rather large, and you should get more contiguous sequences of candidates, but which match the target context less well.
Festival doesn’t do anything to bias against joins in [r] etc – but commercial systems certainly do.
The join cost for naturally-contiguous units is simply defined to be zero and isn’t even calculated.
Festival computes join cost entirely locally, just from the frames either side of the join.
I’ve updated my previous response….
How to do detective work on the target cost?
Well, it will be forensic detective work, I think. You will need to look at the linguistic context of the target and the linguistic context of all available candidates in the database (including the one that was chosen), and then count the mismatches for each: basically, compute the target cost yourself.
I don’t recommend doing this. Looking at a single target in isolation will not tell the whole story: the candidates chosen for all the other targets have an influence on this choice, via the sequence of join costs.
The fact that units with a relatively high target cost have been chosen simply means that they are part of the lowest-overall-cost sequence. One possible reason for that is that there is only one available candidate for a given target diphone type, and so it will be always used, no matter how high the cost (e.g., even if it has “bad F0”).
The same applies for “bad duration”.
You might think that a candidate can only be an outlier if there are several other diphones of the same type. But we look at the two halves of the diphone separately. So, “outlier” is with respect to the monophone duration distribution.
The source code says:
Specifically, if the targ/cand segment type is expected to be voiced, then an f0 of zero is bad (results from poor pitch tracking).
That is, all voiced sounds should have a value for F0, as determined by the pitch tracker.
Festival’s multisyn unit selection engine uses a pure “IFF” target cost function (using Taylor’s terminology). It makes no explicit predictions of any acoustic properties.
The ToBI predictions made by the front end are not used in the target cost.
OK – I see. The basic target cost (the weighted sum of feature mismatches) is normalised to the 0-1 range. After that, penalties may be added for things like “bad F0” or “bad duration” and those penalties can have values such as 25 or 10.
So a target cost of, say, 10.375 is likely to be a basic cost of 0.375 plus a penalty of 10.
There is no need to store word position: it can be deduced easily on the fly by querying the utterance structure (words have a syllable as parent, which in turn has a word as parent).
Multisyn does actually use “phrase position” in the target cost (this was omitted in the lecture slides – apologies). Here are the actual costs currently used in Multisyn:
(10 tc_stress ) (5 tc_syl_pos ) (5 tc_word_pos) (6 tc_partofspeech) (7 tc_phrase_pos) (4 tc_left_context) (3 tc_right_context) (25 tc_bad_f0) (10 tc_bad_duration)
where “tc_phrase_pos” looks as match/mismatch in the phrase break feature of the word that the current segment belongs to.
Can you post an example of negative target costs – e.g., the output of (utt.relation.print yourutt ‘Unit)
There’s no intuition of ‘high’ or ‘low’ costs – it is their value relative to the costs of alternate unit sequences that matters.
February 25, 2016 at 12:52 in reply to: Backoff (from diphones to half phones) • Diphone and half-phone systems are very #2637See section 3.7.1 in this paper
Robert A. J. Clark, Korin Richmond, and Simon King. Multisyn: Open-domain unit selection for the Festival speech synthesis system. Speech Communication, 49(4):317-330, 2007. Publisher’s version, or open access version.
Let’s go over the difference between pitch tracking and pitch marking in the lecture.
For speech, we often conflate the terms “pitch” and “fundamental frequency” even though we really should not. See also this topic.
-
AuthorPosts