› Forums › Speech Synthesis › Festival › Target and join cost
- This topic has 7 replies, 3 voices, and was last updated 8 years, 10 months ago by Simon.
-
AuthorPosts
-
-
February 23, 2016 at 12:15 #2632
I’m trying a lot of different sentences in Festival to find errors. I found a lot of errors but is not always easy to identify the cause. For example, I think I found a join error (in Praat seems clear that it is a bad joint) but the join costs in the utterance Unit relation don’t seem too high. Which value of target and join cost is considered high?
If the target cost is high, like 10, how can we know which feature is the one that is the worst (the pitch, duration, etc?), can we access to the target cost feature calculation of the present utterance?
I ask this because, for example, if I think that this part of the utterance sounds bad because of pitch, and then I check the f0 labels and they look ok, then I just have to assume that the front end predicted wrong the f0 counter, but it would be better to actually see the target specifications that it is trying to achieve.Thanks.
-
February 24, 2016 at 20:22 #2633
Following up on this question. Looking at Lecture 2, slides 62-64:
If sub-costs are either 0 or 1, then scaled by the weights, and the weights range from 4 – 25, how do we end up with some target costs that are less than zero? Is there any intuition for what constitutes as ‘high’ or ‘low’ cost, for either targets or joins?
-
February 25, 2016 at 13:02 #2638
Can you post an example of negative target costs – e.g., the output of (utt.relation.print yourutt ‘Unit)
There’s no intuition of ‘high’ or ‘low’ costs – it is their value relative to the costs of alternate unit sequences that matters.
-
February 25, 2016 at 13:52 #2640
Oops, I didn’t mean ‘less than zero’, I meant ‘less than 1’. Apologies. I meant, if a sub-cost is 1, then scaled by weights greater than 1, how do we end up with values between 0 and 1? In fact, the majority of target costs appear to be in the range of 0 to 1, however there are occasionally costs much higher, in the 10 to 50 range. How does this ‘weight scaling’ actually it work? Is there a decimal point involved in the math somewhere that moves values into the 0 – 1 range? But then how do we sometimes get these much-larger-than 1 values?
-
February 25, 2016 at 13:58 #2641
OK – I see. The basic target cost (the weighted sum of feature mismatches) is normalised to the 0-1 range. After that, penalties may be added for things like “bad F0” or “bad duration” and those penalties can have values such as 25 or 10.
So a target cost of, say, 10.375 is likely to be a basic cost of 0.375 plus a penalty of 10.
-
February 25, 2016 at 14:23 #2642
Aha!!! Now that makes sense. Thank you. So maybe this does in fact give us some intuition: numbers larger than 1 are indicating one of the ‘major penalties’, such as bad duration or bad F0, as at least part of the cost incurred. This in turn might indicate that the database was so sparse for this particular diphone that the ‘best’ selection was a durational/F0 outlier (still waiting for your response as to what constitutes a ‘bad F0’ value – see other post) – a kind of ‘last resort’ choice, which is likely to sound bad (hence the extreme penalty value, to discourage these diphones from ever being selected). Does that line of reasoning make sense? As Pilar pointed out in her original post, it is very difficult to ‘reverse engineer’ the target costs we are seeing, to determine why a particular unit was chosen over the other options. Any suggestions for how to carry out this detective work?
-
February 25, 2016 at 14:44 #2645
The fact that units with a relatively high target cost have been chosen simply means that they are part of the lowest-overall-cost sequence. One possible reason for that is that there is only one available candidate for a given target diphone type, and so it will be always used, no matter how high the cost (e.g., even if it has “bad F0”).
The same applies for “bad duration”.
You might think that a candidate can only be an outlier if there are several other diphones of the same type. But we look at the two halves of the diphone separately. So, “outlier” is with respect to the monophone duration distribution.
-
February 25, 2016 at 14:47 #2646
How to do detective work on the target cost?
Well, it will be forensic detective work, I think. You will need to look at the linguistic context of the target and the linguistic context of all available candidates in the database (including the one that was chosen), and then count the mismatches for each: basically, compute the target cost yourself.
I don’t recommend doing this. Looking at a single target in isolation will not tell the whole story: the candidates chosen for all the other targets have an influence on this choice, via the sequence of join costs.
-
-
AuthorPosts
- You must be logged in to reply to this topic.