› Forums › Speech Synthesis › Unit selection › How to make a "better" system with unit selection?
- This topic has 1 reply, 2 voices, and was last updated 9 years ago by Simon.
-
AuthorPosts
-
-
January 16, 2016 at 10:14 #2088
I’m trying to summarize (and in this way have a general picture of the problem) the ways in which unit selection systems differ from each other and therefore aspects which we could modify in order to build a better system, with higher quality (and I add some questions). The aspects we can modify are:
– As you say in one the videos:
*Database:”The quality of a unit selection system depends very much on the speech database, both the quality of the recorded speech and the accuracy of the labels”: is the database labelled by hand?
But the units we use (diphones or others) would not change the quality according to Taylor. Taylor adds that the size of the database is also important (because it will be more likely to find “natural” joins to match the specification), and the “coverage”.
– Taylor mentions other aspects too:
*Features: the feature system that we use to describe the units and the target specification (s), and also the number of features (more or less data sparsity) and the weights we give to each of them (W)
*Target cost: as a consequence of the feature selection the target cost will be different. Plus the distance metric we use to compare the feature and to give a cost (acoustic or perceptual) (or we could define a target function).
*Join cost: how do we specify this, by using acoustic or categorical distance, or a probabilistic function or a join classifier.
*Search: how do we search for the units in the database, but specially what do we do when we don’t have a feature combination that is required for the specification. One solution is to back off to lower units: in this cases the database has to be labelled in all these different units?Is there any other aspect of the system we can modify to make a system better?
Finally, I want to add a personal impression about this synthesis method: it does not seem so much “synthesis” to me, in the sense that it seems as a clever way to use canned speech without really been raw canned speech (because we are hoping we will have in the database as much as possible to match exactly the specification), so more than a synthesis problem it is actually a data storage and search problem, because at the actual concatenation of the units, speech processing (which for me seems to be more “synthesis”) is avoided as much as possible… but it is just an opinion!
-
January 17, 2016 at 10:44 #2161
Your list of design choices is pretty comprehensive, I think. We’ll recap that in the lecture.
Yes, unit selection really is “synthesis” because it can create an appropriate waveform for any given input text (via the linguistic specification). The data storage and search are how it is implemented but the clever part is deciding what to do when the database doesn’t contain the exact units that we need (and that will happen almost all the time).
-
-
AuthorPosts
- You must be logged in to reply to this topic.