- This topic has 2 replies, 2 voices, and was last updated 5 years, 6 months ago by .
Viewing 2 reply threads
Viewing 2 reply threads
- You must be logged in to reply to this topic.
› Forums › Speech Synthesis › Unit selection › ASF – translating linguistic features to acoustic representation
In ASF do we typically view, for example, cepstrum coefficient vectors as the acoustic representation of the linguistic features.
Do this mean that, as mentioned in Taylor Chapter 16.4.1 and .2, that we use these vectors in either the HMM method or decision-tree clustering to extract the distribution of observed feature combinations (e.g. stress and phrase finality?)
Basically, the question I am asking is how do we firstly translate the linguistic features into acoustic feature values?
Predicting acoustic features from linguistic features is a regression problem. We already have the necessary labelled training data: the speech database that will be used for unit selection.
One way to do the regression would be to train a regression tree (a CART). This is the method used in so-called “HMM-based speech synthesis” that we will cover in the second half of the course. But in HMM synthesis, the predicted acoustic features are used as input to a vocoder to create a waveform, rather than in an ASF target cost function.
We might then replace the tree with a better regression model: a neural network. We’ll cover this method after HMM synthesis.
Once we know about HMM and neural network speech synthesis (both using vocoders rather than unit selection + waveform concatenation), we can then come back to the ASF formulation of unit selection. We will find that this is usually called “hybrid speech synthesis” and is covered towards the end of the course.
Some forums are only available if you are logged in. Searching will only return results from those forums if you log in.
Copyright © 2024 · Balance Child Theme on Genesis Framework · WordPress · Log in