› Forums › Speech Synthesis › Festival › Fixing a unit sequence
- This topic has 3 replies, 2 voices, and was last updated 4 years, 8 months ago by Simon.
-
AuthorPosts
-
-
April 19, 2020 at 11:26 #11185
I’ve created two unit selection voices, and I’d like them both to use the same unit sequence when synthesising an utterance. Is there any way I can do this? Can I extract the unit sequence given by one unit selection engine and feed it into another voice?
-
April 19, 2020 at 13:59 #11186
The only special case where you might want to do this is when demonstrating the effect of fixing a labelling error. The instructions suggest restricting the database to only the utterances providing the desired units. (This is not guaranteed to keep the unit sequence the same, if there are multiple instances of some diphones, but usually does the trick.)
However, it’s not possible in general to impose this constraint on a full voice. Given that almost any change in a voice (e.g., different F0 estimation settings) will change either the join or target costs, you may get a different unit sequence. Constraining the unit sequence effectively means ignoring join and target costs.
Why do you want to do that, in your case?
-
April 19, 2020 at 16:13 #11189
I want to test the effect of using different numbers of mixture components during forced alignment labelling. My hypothesis was, because better labelling leads to better join points, listeners would hear fewer (or less noticeable) joins and rate the speech as more natural sounding.
I ran a listener test, and found no significant impact between voices labeled with different numbers of mixture components during forced alignment. I was trying to figure out why, and I found that the unit sequences were different between the two sequences.
This means I might not really have been measuring the true effect of better labelling. Listeners might not be indifferent to join point quality, but actually have been responding to the choice of units. I also found that more units were marked with ‘bad duration’ in voices with more mixture components in forced alignment. Labelling more units as outliers and imposing a high target cost might have meant potentially good candidate units were never used because of a ‘bad duration’ cost that was too high.
I wanted to force all the voices to use the same unit sequence so I could evaluate whether joins had generally been made better. If I could prove this, it would indicate there are some problems with the unit selection engine as is. That is, better labelling can make unit selection worse.
-
April 19, 2020 at 17:44 #11190
You’re experiencing the downside of unit selection – instability and unpredictable behaviour.
It’s also not necessarily the case that more powerful acoustic models provide a more accurate alignment.
Instead of forcing unit choices, you might instead think of some evidence that you can present to back up your claim that listeners were “responding to the choice of units”. This could be informal / qualitative / small scale / based on your own listening or analysis – no need for another listening test.
-
-
AuthorPosts
- You must be logged in to reply to this topic.