› Forums › Speech Processing – Live Q&A Sessions › Module 6 – unit selection and diphone concatenation
- This topic has 1 reply, 2 voices, and was last updated 2 years, 2 months ago by Simon.
-
AuthorPosts
-
-
October 31, 2022 at 09:27 #16177
The most basic diphone database would consist of all combinations of the phones in a given language (plus combinations with silence). Selecting units from this database is a trivial task as there is only one possible match for every target diphone. That is why we don’t talk about unit selection as a specific step in diphone concatenation. However, it seems to me that it was hinted at as a step in both the videos and the reading (?) which makes me think that the diphone concatenation model and the unit selection model exist on a continuum: some databases will have the above specified basic inventory of diphones and some more combinations as well/a few longer utterances and the unit selection process would then be applied, even though the database is almost as simplistic as the one described above.
Is it the case that we could think of diphone concatenation and unit selection synthesis not as a binary but as one model of waveform generation where the relative importance of each step is different depending on what’s in the database – the unit selection step is trivial in simplistic databases and central in databases with long utterances, whereas signal processing is trivial in databases with long utterances but the main step in simplistic databases (but all steps are technically present in diphone concatenation and unit selection synthesis)? -
November 3, 2022 at 11:05 #16224
Yes, it’s possible to think of diphone synthesis and unit selection along a continuum.
At one end is diphone synthesis in which we have exactly one copy of each diphone type, so there no need for a selection algorithm, but we will need lots of signal processing to manipulate the recordings.
At the other end is unit selection with an infinitely large database containing all conceivable variants of every possible diphone type. Now, selection from that database becomes the critical step. With perfect selection criteria, we will find such a perfect sequence of units that no signal processing will be required: the units will already have exactly the right acoustic properties for the utterance being synthesised.
Real systems, of course, live somewhere in-between. We can’t have an infinite database; even if we could, there are no perfect selection criteria (target cost and join cost). This real system will select a pretty good sequence of units most of the time. A little signal processing might be employed, for example to smooth the joins.
-
-
AuthorPosts
- You must be logged in to reply to this topic.