Forum Replies Created
-
AuthorPosts
-
In the video, the question Does letter n=”r”? is just one of many possible questions that we will try to split the root node. One of the many questions will reduce entropy more than the others, and that one is placed in the tree and the data are permanently partitioned down the “Yes” and “No” branches.
Then we recurse. That means that we simply apply precisely the same procedure separately to each of the two child nodes that we have just created; then we do the same for their child nodes, and so on until we decide to stop.
Why is entropy used to choose the best question?
Think about the goal of the tree: it is to make predictions. In other words, we want to partition the data in a way that makes the value of the predictee less random (i.e., less unpredictable) and more predictable.
Entropy is a measure of how unpredictable a random variable is. The random variable here is the predictee. We partition the data in the way that makes the distribution of the predictee as non-uniform as possible.
Ideally, we want all data points within each partition to have the same value for the predictee. That would mean zero entropy.
If we can’t achieve that, then we choose the split that has the lowest possible entropy.
A weighted sum gives a weight (or “importance”) to each of the items being added together, than others. So,items larger weights have more effect on the result and vice versa.
In the CART training algorithm, a weighted sum is used to compute the total entropy of a possible partition of the data. The weighting is needed to correct for the fact that each side of the partition (the “Yes” and “No” branches) might have differing numbers of data points, and to make the result comparable to the entropy at the parent node. We set the weights in the weighted sum to reflect the fraction of data points on each side.
Imagine this example:
We have 1000 data points at a particular node in the tree, and the entropy here is 3.4 bits.
We try a question, and the result is that 500 data points go down the “No” branch and 500 data points go down the “Yes” branch.
This question turns out to be pretty useless, because the distribution of predictee values in each branch remains about the same as at the parent node. So, the entropy in each side is going to be about 3.4 bits.
An evenly weighted sum of these two values would give the wrong answer of 7.8 bits. We need to do a weighted sum:
(0.5 x 3.4) + (0.5 x 3.4) = 3.4 bits
The same argument holds whatever the entropy of the two branches, and whatever proportion of data points goes down each branch.
This is a little beyond the Speech Processing course, but is covered fully in the more advanced Speech Synthesis course.
The short answer is that most statistical parametric speech synthesisers (whether HMM or DNN) use a source-filter model to generate the waveform. The HMM or DNN predicts the parameters of the source (e.g., F0) and of the filter (e.g., its frequency response).
Siyu, your understanding is exactly right: CART partitions the feature (predictor) space in a binary fashion. Each node lower down the tree subdivides the partition created by the answer (“Yes” or “No”) of its parent node.
See Figure 1 on this page, for a regression tree example.
Enno correctly points out that you can transform the features in any way you wish. But, it’s important to recognise that this would be “feature engineering” and would be done before starting to build the CART. The CART training algorithm can only select amongst the questions provided about the predictors; it cannot invent new predictors, or new questions, or learn feature transforms.
You could just pick one at random. Or, you might have some secondary criterion such as choosing the one with the most balanced split (i.e., the one closest to a 50/50 split) on the grounds that small partitions are a bad thing.
Why is a small partition a bad thing? What consequences might it have when we try to split it?
On the system here in Edinburgh, you can see a dictionary that marks word sense at
/Volumes/ss/festival/festival_mac/festival/lib/dicts/unilex/unilex-edi.out
You can count the number entries that include a word sense – it’s very small: 342 out of 116740 lexical baseforms. Here’s an extract:
("repress" (vb keep-down) (((t^ i) 0) ((p r e s) 1))) ("repress" (vb press-again) (((t^ ii) 3) ((p r e s) 1))) ("repress" (vbp keep-down) (((t^ i) 0) ((p r e s) 1))) ("repress" (vbp press-again) (((t^ ii) 3) ((p r e s) 1))) ("repressed" (jj keep-down) (((t^ i) 0) ((p r e s t) 1))) ("repressed" (jj press-again) (((t^ ii) 3) ((p r e s t) 1))) ("repressed" (vbd keep-down) (((t^ i) 0) ((p r e s t) 1))) ("repressed" (vbd press-again) (((t^ ii) 3) ((p r e s t) 1))) ("repressed" (vbn keep-down) (((t^ i) 0) ((p r e s t) 1))) ("repressed" (vbn press-again) (((t^ ii) 3) ((p r e s t) 1))) ("represses" (vbz keep-down) (((t^ i) 0) ((p r e s) 1) ((i z) 0))) ("represses" (vbz press-again) (((t^ ii) 3) ((p r e s) 1) ((i z) 0))) ("repressing" (jj keep-down) (((t^ i) 0) ((p r e s) 1) ((i n) 0))) ("repressing" (jj press-again) (((t^ ii) 3) ((p r e s) 1) ((i n) 0))) ("repressing" (nn keep-down) (((t^ i) 0) ((p r e s) 1) ((i n) 0))) ("repressing" (nn press-again) (((t^ ii) 3) ((p r e s) 1) ((i n) 0))) ("repressing" (vbg keep-down) (((t^ i) 0) ((p r e s) 1) ((i n) 0))) ("repressing" (vbg press-again) (((t^ ii) 3) ((p r e s) 1) ((i n) 0)))
To get a better idea of how often this matters in practice, we would need to take a large corpus of text that typifies the type of input text we expect, and count how often one of those 342 words occurs.
To refine that, we should only count the times where its pronunciation would have been incorrect based on POS alone. However, that would be expensive, because we would have to know the correct pronunciation – for example, by manually annotating the text.
ToBI associates accents with words, but in fact intonation events align with syllables. Accents align with a particular syllable in the word (usually one with lexical stress) and their precise timing (earlier or later) can also matter.
In the accent ratio model, the ≤0.05 is saying “no more than 5%” and is a statistical significance test. It is there so that the model only makes predictions in cases where it is confident (e.g., because it has seen enough examples in the training data).
(in future, please can you split each question into a separate post – it makes the forums easier to read and search)
ToBI is a description of the shape of intonation events (i.e., small fragments of F0 contour). We could make a syllable sound more prominent using one of several different shapes of F0 contour; the most obvious choice is a simple rise-fall (H*) but other shapes can also add prominence.
ToBI does also attempt to associate a function with some accent types (e.g., L* for “surprise” in Figure 8.10). But, many people (including me) are sceptical about this functional aspect of ToBI, because there really isn’t a simple mapping between shapes of F0 contours and the underlying meaning.
“IP” means “intonation phrase, as described in 8.3.1. So an “IP-initial accent” is the first accent in an intonation phrase.
There is a note on the page
http://www.speech.zone/courses/speech-processing/synthesis/front-end/cart/
that says
There videos for this part of the course are incomplete. We’ll cover CART in detail in the lectures. But make sure to watch the video in the related post below.
This material is certainly still part of the syllabus.
In the Entropy: understanding the equation I used a few carefully chosen example distributions to help you understand the general formula for entropy.
At 5:30 in the video, you will see that I used this code for sending messages about three values:
Code 1
- green = 0
- blue = 10
- red = 11
but I didn’t go into the precise reason for choosing that code rather than, say
Code 2
- green = 0
- blue = 1
- red = 01
So, let’s clear that detail up now. When we transmit a variable length code, we also have to make it possible for the receiver to know when each item in the message starts and finishes. In other words, for any string of bits, there has to be a single unambiguous decoding of the message.
Consider sending the message “green green blue red”
Using Code 1: 001011
Using Code 2: 00101
At this point, it looks like code 2 is better – it can send the message with fewer bits. But now let’s try to decode them. Using Code 1, the message is unambiguous
001011 = green green blue red , and there are no other possible ways to decode
But using Code 2 we have more than one possible decoding
00101 = green green blue red
00101 – green red blueSo, that code is not allowed!
Your code has the same problem:
- EH = 0
- AA = 1
- AO with 01
because the message “EH AH” is coded as 01, and this cannot be decoded unambiguously (it might mean “EH AH” or it might mean “AO”).
Yes – the log (base 2) is converting to bits.
Log (base 2) comes up quite often in this context, and related ones. For example, the depth of a binary decision tree is of the order of log (base 2) of the number of leaves.
Jonas is correct.
An astute question!
A pragmatic reason that CARTs are used extensively in Festival is because the toolkit behind Festival (called “Edinburgh Speech Tools”) has a CART-building program (called “Wagon”).
There are of course many alternative, and sometimes better, forms of machine learning, such as Neural Networks. If Festival was re-written today, it would not use CARTs quite so extensively.
In the Speech Processing course, you should see prosody prediction using CARTs mainly as an example of how CARTs can be applied to both classification and regression problems. Don’t see it as representing the state-of-the-art in prosody prediction for TTS.
I think you are asking about the distribution of labels at a leaf of the tree – is that what you mean?
In general, with real data, we will not get pure leaves (i.e., all data points have a single label). So, we can say that there is always a distribution of labels at every leaf.
The question then becomes: how do we make use of that, when making predictions for unseen test data points? There are two possibilities:
- give the test data point the majority label of the leaf that it reaches
- give the test data point a probabilistic label (i.e., a distribution across all possible label values) of the leaf that it reaches
In the second case, some subsequent process will have to resolve the uncertainty about the label – perhaps by using additional information such as the sequence of labels assigned to preceding and following points (in a sequence).
The video is a little informal, and you spotted that I didn’t actually specify what stopping criterion I used.
What’s the stopping criterion?
Several different stopping criteria might have prevented me splitting any further at that node. For example:
- the number of data points at this node fell below a fixed threshold
- none of the remaining questions gave a useful split of the data (i.e., none of them could decrease the entropy by much)
- any remaining questions that did actually reduce entropy also resulted in a split in which one of the partitions had a very small number of data points
Can you explain the reasoning behind each of these different stopping criteria? Can you think of any other stopping criteria?
How would having a leaf with an evenly split distribution help sort all the unlabelled samples?
The reasoning here is that unbalanced splits mean that the select question only applies to a small number of data points, and so is less likely to also apply to the test data. Balanced splits result from questions that apply to “around half” of the data points, and so we are a lot more confident that these questions will generalise to unseen test data points.
Let’s also change your terminology: instead of “sort all the unlabelled samples” you should say “makes accurate predictions of the labels of unseen test samples”.
Do you need the criterion in cases with huge sets of examples, where having no specified stopping point could result in an unmanageable number of splits?
It’s not really a problem to have a deep tree, so the number of splits can’t really become “unmanageable”.
Can you write down a formula that calculates the depth of the tree (average number of questions/splits/nodes between root and leaves), as a function of the number of leaves?
-
AuthorPosts