› Forums › Speech Synthesis › Unit selection › Diphone boundaries
- This topic has 3 replies, 3 voices, and was last updated 8 years, 10 months ago by Simon.
-
AuthorPosts
-
-
February 2, 2016 at 20:37 #2399
I’m confused about exactly how/where Festival goes about determining the actual diphone boundaries, and if/where it stores them. Trying to read the manual, but there is nothing in the most recent manual I can find (2014?) regarding multisyn, which it seems we are using. Is it based on the same architecture as UniSyn? If so, the manual says there is a ‘diphone index’…somewhere. I’ve looked through the various scripts we use in the assignment, and the only mention I can find of diphones is in the strip_join_cost_coefs script, which obviously would need to know where the diphone boundaries are if its going to only keep the ‘edge’ frames.
Is there a diphone index? If so, why is it that information stored separately from the .utt file (which only seems to have the phone timing info, not the diphone boundary info, but does seem to contain the rest of the linguistic specification necessary to compute target cost). Can you shed some light on this to help me understand? -
February 3, 2016 at 08:36 #2409
Diphone boundaries are generally just the midpoint between phone boundaries. So, there is no need to store this information in the .utt files because it’s very fast to compute on the fly (e.g., as the file is loaded).
Likewise, it’s easy to construct an index of all available diphones on the fly, as the .utt files are loaded, and store it in memory.
-
February 4, 2016 at 10:32 #2410
Is it this function in “strip join cost coef” that’s calculating the middle point?:
def join_point_time(item):
if item.f_present( “cl_end” ):
return item.F( “cl_end” )
elif item.f_present( “dipth” ):
return (0.75*item.F( “start” )) + (0.25*item.F(“end”))
else :
return (item.F( “start” ) + item.F(“end”))/2Apparently does something different for stops and diphthongs, otherwise it just takes the start and the end, sums and divides by 2 to get the half point.
-
February 4, 2016 at 13:10 #2411
That function is calculating the midpoints, yes. The code you’re showing is used for stripping the join cost coefficients during voice building, but it’s performing the same calculation that is done during synthesis.
In lectures, we did indeed gloss over a couple of special cases:
Diphthongs: the 50% point is a poor choice, since the spectrum may be changing rapidly there, so we make the join 75% of the way through the segment where the spectrum is generally a little more stable.
Stops: the end of the closure (stored in cl_end) will have been found during forced alignment (how?) and so we use that as the join point; picking the 50% point in a stop (=closure+burst) might sometimes be before the burst, and other times in the middle of the burst, so would be a bad place to make a join (e.g., we might end up with two bursts in the synthetic speech).
-
-
AuthorPosts
- You must be logged in to reply to this topic.