› Forums › Speech Synthesis › The front end › Diphone continuity from word to word
- This topic has 3 replies, 2 voices, and was last updated 8 years, 10 months ago by Simon.
-
AuthorPosts
-
-
October 13, 2015 at 21:07 #298
In any diphone TTS system (either a simple system or a unit selection system that uses diphones), is it correct that UNLESS the front end tags a phrase break in the text, the system perceives words or tokens as un-interrupted, and therefore it searches for diphones that go ACROSS the word/token boundary? Or are word/token boundaries treated as very small gaps, not as large as a phrase break, but still big enough to require a diphone that goes from middle of the last phone of a word to silence? (Leaving aside for the moment post-lexical liasons and r-insertions, which would obviously imply complete continuity).
-
October 14, 2015 at 11:06 #308
There is nothing special about cross-word diphones compared to within-word diphones. Speech does not have “gaps” between words unless there is a phrase break. We can use diphones recorded within a word to synthesise across a word boundary.
You correctly state that phrase breaks will only occur in places that the front-end predicts. All other word boundaries are just continuous diphone sequences, no different to within the words.
Of course, the number of possible diphones across word boundaries is higher than within words (where phonology constrains the possible combinations). So, we are much more likely to encounter low-frequency (i.e., rare) diphones across word boundaries.
-
October 19, 2015 at 20:39 #361
Following up on this: In Festival, it appears that when it can’t find a diphone, it will back off to the next-best diphone. EXCEPT when the missing diphone is an ‘Interword’, in which case it inserts silence. Which, as one would guess, sounds bad when the utterance is played. Here is an example from Festival after issuing the Wave_Synth command on my utterance:
Missing diphone: @_dh
Interword so inseting silence.
Missing diphone: @_hw
Interword so inseting silence.
Missing diphone: jh_iii
diphone still missing, backing off: jh_iii
backed off: jh_iii -> jh_ii
Missing diphone: ch_z
Interword so inseting silence.The first 2 specific diphones that it can’t find don’t strike me as particularly rare. The actual word sequence of @_dh is “to the” and @_hw is from “the white”.
Why would such (seemingly) common diphones not be in the database? Does the diphone set for this voice ONLY contain diphones that were recorded within words, or does it have SOME interword-only (diphones NOT derived from within words) diphones, just not all possible ones? -
October 19, 2015 at 20:52 #362
You have correctly found that this voice does indeed have many missing diphones. A larger or more carefully-designed recording script would not have this problem.
The reason this happens so frequently for this voice is that the diphone coverage was determined using one dictionary (CMUlex) but the voice has been built with a different dictionary (Unisyn). Normally, we wouldn’t do that, but it’s useful for the purposes of this assignment to show what happens when diphones are missing.
The database comprises sentences of connected speech, so does have both within- and across-word diphones.
The database is the awb speaker (i.e., Alan Black himself) from the ARCTIC set of corpora
Footnote: the voice is called voice_cstr_edi_awb_arctic_multisyn which means “built in CSTR / Edinburgh accent / speaker ‘awb’ / ARCTIC corpus / multisyn unit selection engine”
-
-
AuthorPosts
- You must be logged in to reply to this topic.