› Forums › Speech Synthesis › Unit selection › Join cost in Multisyn
- This topic has 6 replies, 4 voices, and was last updated 4 years, 3 months ago by Simon.
-
AuthorPosts
-
-
April 5, 2018 at 15:01 #9224
In the videos on the site it is said that only one frame either side of the join is taken into consideration for the join cost. However, in the Festival multisyn paper it says two frames either side are considered. Have I misread something, or does this just depend on different versions or something?
-
April 5, 2018 at 15:19 #9225
Can you point me to the exact place in the paper that this is mentioned please?
-
April 5, 2018 at 16:37 #9226
‘Spectral discontinuity is estimated by calculating the Euclidean distance between two vectors of 12 MFCCs from either side of a potential join point, as the MFCCs are usually mean/variance normalised first, this is effectively a Mahalanobis distance with diagonal covariance.’ in 3.9: join cost, in Clark et al 2007.
-
April 5, 2018 at 17:28 #9227
Ah – poor wording in the paper. Blame the last author. This is clearer:
“Spectral discontinuity is estimated by calculating the Euclidean distance between a pair of vectors of 12 MFCCs: one from either side of a potential join point.”
So, indeed, there is one frame either side of the join.
-
April 5, 2018 at 17:51 #9229
Yes, just 1 frame either side.
(The code is the ultimate documentation, and the code definitely says 1 frame!)
-
April 17, 2020 at 19:00 #11180
In Taylor’s chapter 16, it’s mentioned that the phone class of different units “should play an important part in any well-designed join cost function” (p.499). Is this at all part of the join cost in Festival?
In the Multisyn paper it states that differences in voicing between two units incur different penalties, but I haven’t found anything stating that the actual phone class has an effect on the cost. If this is the case, why do we distinguish between the closure and burst of a plosive when labelling the database?
-
April 19, 2020 at 17:50 #11191
Taylor wrote that from his experience building two commercial systems, which were successors to Festival.
Festival doesn’t do anything to vary the components of the join cost, beyond the special case of one diphone being voiced at the join point and the other unvoiced (according to estimated F0).
The use of separate labels for closure and burst in plosives is only for forced alignment. It allows the join point to be placed reliably at the midpoint of the closure (the midpoint of the entire segment would sometimes be in the closure, sometimes in the burst, leading to synthesised plosives with 0, 1, or 2 bursts).
-
-
AuthorPosts
- You must be logged in to reply to this topic.