› Forums › Speech Synthesis › HMM synthesis › Decision Tree Clustering
- This topic has 1 reply, 2 voices, and was last updated 8 years, 9 months ago by Simon.
-
AuthorPosts
-
-
April 23, 2016 at 20:40 #3166
I have a couple of small things I think I need cleared up about decision-tree clustering.
Would I be correct in saying that, to address the data-sparsity issues that arise from only being able to obtain speech parameters from a database for some of the context-dependent HMMs we would want to build, that we use phonetic features such as voicing, place and manner of articulation to quantify how acoustically similar possible context-dependent phones will be – with the somewhat obvious assumption that phones sharing the same place of articulation are acoustically similar than those that do not, for example.
We implement this clustering using decision-trees, and the phonetic features act as constraints on the decision tree nodes and the acoustically similar models will become clustered somewhat “organically” from this process – we do not need to explicitly determine which models should be clustered together, but rather just let the constraints do the work for us. I’ve attached an image that states that nasalisation is a rather important phonetic difference, and the vowel/consonant specification of the preceding phone is much less important.
At this stage my question is once we have the clustered models, how do we represent this in a generated waveform? Do we take an average of the parameters of the clustered models and use that as our new model?
Attachments:
You must be logged in to view attached files. -
April 26, 2016 at 16:01 #3168
Your reasoning behind why we need to cluster (also called “tie”) models is correct, yet.
The nodes in the tree each contain a question about a phonetic feature (e.g., “is the previous phone nasal?”). The tree is simply a CART. The phonetic features are the predictors. The predictee is the current model state’s parameters (mean and variance of its Gaussian).
The tree is learned in very much the same way as a classification or regression tree.
Your question about how this eventually affects the generated waveform can be restated in two parts
1. how does this affect the models’ parameters?
2. how do model parameters affect the waveform that they generate?
The answer to 1. you have already figured out: the models share parameters, that’s all. We don’t need to average the group of models (actually, model states) that end up at a leaf – we simply have only one shared (= tied) state there and it is trained on all the corresponding data. So, If you like, you might think instead of the tree finding all the suitable data that this shared state should be trained on, pooled across a group of sufficiently-similar contexts.
The answer to 2. is via the usual generation process of statistical parametric speech synthesis: the models generate trajectories of vocoder parameters, and those are then vocoded into a waveform.
-
-
AuthorPosts
- You must be logged in to reply to this topic.