› Forums › Speech Synthesis › Unit selection › 'Perceptual/Acoustic Space' in ASF
- This topic has 3 replies, 3 voices, and was last updated 8 years, 11 months ago by Simon.
-
AuthorPosts
-
-
January 16, 2016 at 02:00 #2087
As for an explanation of how and why different feature combinations are placed within a given ‘acoustic space’, Taylor didn’t go much further than saying that “the positions of the feature combinations are not determined by feature values, but rather by the acoustic definitions of each feature combination. Hence, these can lie at any arbitrary point in space” (Taylor: 494), which I don’t find particularly helpful. HMMs can be used, and modeled using a multivariate Gaussian, okay. I just would like to know more concretely about how to determine the placement of feature combinations in relation to each other. A vague question, maybe. But an inability to visualize the process really bugged me.
-
January 17, 2016 at 10:30 #2159
We’ll look at this in the lecture.
By “arbitrary point” Taylor means that the acoustic features take continuous values (e.g., F0 in Hz) rather than discrete values (e.g., “stressed?”)
Those acoustic values have been predicted, given the linguistic features. We actually already know how to build a model to make such predictions – see Speech Processing…
-
January 22, 2016 at 19:24 #2210
A related question: It seems to me that the biggest difference between IFF and ASF is that IFF measures the distance between target and unit by weighted linguistic features, whereas ASF measures that by acoustic features. Is it right? Since these acoustic features are determined by linguistic features, there should be some direct mappings between linguistic features to acoustic features. So it seems to me that the two approaches are very similar. I do not understand why Taylor says they are two distinct approaches.
-
-
January 22, 2016 at 21:04 #2216
Your descriptions of IFF and ASF are correct. You are also right to say that the acoustic features in ASF are predicted from the same linguistic features used in IFF.
The key point to understand is that different combinations of linguistic features all have the same acoustic features. So, sparsity might be less of a problem in the ASF case.
In other words, we don’t really need to find a candidate unit that has the same linguistic features as the target, we just need it to sound like it has the same linguistic features.
However, for an ASF target cost to work well, we need to
- predict the acoustic features accurately from the linguistic features
- measure distances in acoustic space in a way that correlates with perception
Neither of those are trivial. Your phrase “direct mappings” suggests these mappings are easy to learn: they are not.
-
-
AuthorPosts
- You must be logged in to reply to this topic.