› Forums › Speech Synthesis › HMM synthesis › Eigenvoice
- This topic has 3 replies, 2 voices, and was last updated 8 years, 5 months ago by Simon.
-
AuthorPosts
-
-
March 13, 2016 at 20:16 #2806
Could you offer a clear and simple explanation of this term/concept? All the definitions I’ve found are a little beyond my grasp.
-
March 15, 2016 at 09:48 #2810
Informally, think of eigenvoices as being a set of “axes” in some abstract “speaker space”. We can create a voice for any new speaker (i.e., we can do speaker adaptation) as a weighted combination of these eigenvoices. The only parameters we need to learn are the weights. Because the number of weights will be very small (compared to the number of model parameters), we can learn them from a very small amount of data.
When you first try to understand this concept, it’s OK to imagine that the eigenvoices correspond to the actual real speakers in the training set.
In fact we can do better than that, by finding a set of basis vectors that is as small as possible (smaller than the number of training speakers) whilst still being able to represent all the different “axes” of variation across speakers.
(To get in to more depth, this topic would need more than just this written forum answer. I can consider including it in lecture 10 if you wish.)
-
March 15, 2016 at 10:33 #2811
Most references to eigenvoices mention that the concept was derived from ‘eigenfaces’, of which the wikipedia entry says:
“Informally, eigenfaces can be considered a set of “standardized face ingredients”, derived from statistical analysis of many pictures of faces. Any human face can be considered to be a combination of these standard faces. For example, one’s face might be composed of the average face plus 10% from eigenface 1, 55% from eigenface 2, and even -3% from eigenface 3. Remarkably, it does not take many eigenfaces combined together to achieve a fair approximation of most faces.”That corresponds very much to what you wrote above, regarding mixing the basis vectors with different weights. Can we go further with the analogy and say that these eigenvoices can be thought of as a set of ‘standardized voice ingredients’?
And as with eigenfaces, which we can look at and they look like blurry approximations, or maybe templates, of different kinds of actual faces, could we listen to eigenvoices, and would they sound like fuzzy approximations, or some kind of aural foundation, of different kinds of voices? -
March 15, 2016 at 11:05 #2812
Yes, thinking of eigenvoices as “standardized voice ingredients” is reasonable.
One problem with trying to listen to these voices is that the models are constructed in a normalised space, and so it doesn’t actually make sense to synthesise from the underlying models. The same problem would occur when trying to listen to the eigenvoices: they may not make sense on their own.
Here are some slides and examples from Mark Gales that give an overview of the main ideas of “Controllable and Adaptable Speech Synthesis”.
-
-
AuthorPosts
- You must be logged in to reply to this topic.