Forum Replies Created
-
AuthorPosts
-
Yes, a pulse train is an approximation to the sound produced by vocal fold vibration.
Although it might not seem like a particularly good approximation, it is simple and mathematically convenient. The principal difference between a pulse train and the true signal, is that the pulse train has a flat spectral envelope.
That’s not a problem though: we can include the modelling of the actual spectral envelope of the vocal fold signal in the vocal tract filter.
In other respects, the pulse train has the correct properties: specifically, it has energy at every multiple of F0 (a “comb-like” or “line” spectrum).
So, we can say that a source-filter model is really a model of the signal, and not a literal model of the physics of speech production.
Let’s follow your working:
I label the predictors as their Parts-Of-Speech
Correct – but you should say that you annotate the training data samples with values for each of the three predictors. In this example, we are using the POS of the preceding, current and following word as the predictors 1, 2 and 3 respectively.
use the question Is the label after “BREAK” PUNC?
Let’s word that more carefully. Questions must be about predictors of the current data point. So you should say:
Ask the question Is predictor 3 = PUNC?
Now partition the data accordingly.
everything which is punctuation comes after a BREAK, and everything which isn’t punctuation is a conjunction
This is where you’ve made the mistake. For question Is predictor 3 = PUNC?, 8 data points have the answer “Yes” and all of them have the value “NO-BREAK” for the predictee, which indeed is a distribution with zero entropy. So far, so good.
Now look at the 26 data points for which the answer to Is predictor 3 = PUNC? was “No”. The distribution of predictee values is 4 BREAKs and 22 NO-BREAKs. That distribution does not have zero entropy.
Your reasoning about “everything which isn’t punctuation is a conjunction” is incorrect. You are looking at the distribution of values of a predictor. When measuring entropy, we look only at the value of the predictee. That is the thing we are trying to predict. Entropy is measuring how much more predictable it has become after a particular split of the data.
I’ve tried in the past setting questions as homework or lecture preparation, but only a few students took part. We can try again, perhaps in the speech recognition part of the Speech Processing course.
Pitch marking means finding the instants of glottal closure. Pitch marks are moments in time. The interval of time between two pitch marks is the pitch period, denoted as T0, which of course is equal to 1 / F0.
You might think that pitch marking would be the best way to find F0. However, it’s actually not, because pitch marking is hard to do accurately and will give a lot of local error in the estimate for F0.
Pitch marking useful for signal processing, such as TD-PSOLA.
Pitch tracking is a procedure to find the value of F0, as it varies over time.
Pitch tracking is done over longer windows (i.e., multiple pitch periods) to get better accuracy, and can take advantage of the continuity of F0 in order to get a more robust and error-free estimate of its value.
Pitch tracking useful for visualising F0, analysing intonation, and building models of it.
Exactly how pitch marking and pitch tracking work is beyond the scope of the Speech Processing course, but is covered in the more advanced Speech Synthesis course.
In the video, the question Does letter n=”r”? is just one of many possible questions that we will try to split the root node. One of the many questions will reduce entropy more than the others, and that one is placed in the tree and the data are permanently partitioned down the “Yes” and “No” branches.
Then we recurse. That means that we simply apply precisely the same procedure separately to each of the two child nodes that we have just created; then we do the same for their child nodes, and so on until we decide to stop.
Why is entropy used to choose the best question?
Think about the goal of the tree: it is to make predictions. In other words, we want to partition the data in a way that makes the value of the predictee less random (i.e., less unpredictable) and more predictable.
Entropy is a measure of how unpredictable a random variable is. The random variable here is the predictee. We partition the data in the way that makes the distribution of the predictee as non-uniform as possible.
Ideally, we want all data points within each partition to have the same value for the predictee. That would mean zero entropy.
If we can’t achieve that, then we choose the split that has the lowest possible entropy.
A weighted sum gives a weight (or “importance”) to each of the items being added together, than others. So,items larger weights have more effect on the result and vice versa.
In the CART training algorithm, a weighted sum is used to compute the total entropy of a possible partition of the data. The weighting is needed to correct for the fact that each side of the partition (the “Yes” and “No” branches) might have differing numbers of data points, and to make the result comparable to the entropy at the parent node. We set the weights in the weighted sum to reflect the fraction of data points on each side.
Imagine this example:
We have 1000 data points at a particular node in the tree, and the entropy here is 3.4 bits.
We try a question, and the result is that 500 data points go down the “No” branch and 500 data points go down the “Yes” branch.
This question turns out to be pretty useless, because the distribution of predictee values in each branch remains about the same as at the parent node. So, the entropy in each side is going to be about 3.4 bits.
An evenly weighted sum of these two values would give the wrong answer of 7.8 bits. We need to do a weighted sum:
(0.5 x 3.4) + (0.5 x 3.4) = 3.4 bits
The same argument holds whatever the entropy of the two branches, and whatever proportion of data points goes down each branch.
This is a little beyond the Speech Processing course, but is covered fully in the more advanced Speech Synthesis course.
The short answer is that most statistical parametric speech synthesisers (whether HMM or DNN) use a source-filter model to generate the waveform. The HMM or DNN predicts the parameters of the source (e.g., F0) and of the filter (e.g., its frequency response).
Siyu, your understanding is exactly right: CART partitions the feature (predictor) space in a binary fashion. Each node lower down the tree subdivides the partition created by the answer (“Yes” or “No”) of its parent node.
See Figure 1 on this page, for a regression tree example.
Enno correctly points out that you can transform the features in any way you wish. But, it’s important to recognise that this would be “feature engineering” and would be done before starting to build the CART. The CART training algorithm can only select amongst the questions provided about the predictors; it cannot invent new predictors, or new questions, or learn feature transforms.
You could just pick one at random. Or, you might have some secondary criterion such as choosing the one with the most balanced split (i.e., the one closest to a 50/50 split) on the grounds that small partitions are a bad thing.
Why is a small partition a bad thing? What consequences might it have when we try to split it?
On the system here in Edinburgh, you can see a dictionary that marks word sense at
/Volumes/ss/festival/festival_mac/festival/lib/dicts/unilex/unilex-edi.out
You can count the number entries that include a word sense – it’s very small: 342 out of 116740 lexical baseforms. Here’s an extract:
("repress" (vb keep-down) (((t^ i) 0) ((p r e s) 1))) ("repress" (vb press-again) (((t^ ii) 3) ((p r e s) 1))) ("repress" (vbp keep-down) (((t^ i) 0) ((p r e s) 1))) ("repress" (vbp press-again) (((t^ ii) 3) ((p r e s) 1))) ("repressed" (jj keep-down) (((t^ i) 0) ((p r e s t) 1))) ("repressed" (jj press-again) (((t^ ii) 3) ((p r e s t) 1))) ("repressed" (vbd keep-down) (((t^ i) 0) ((p r e s t) 1))) ("repressed" (vbd press-again) (((t^ ii) 3) ((p r e s t) 1))) ("repressed" (vbn keep-down) (((t^ i) 0) ((p r e s t) 1))) ("repressed" (vbn press-again) (((t^ ii) 3) ((p r e s t) 1))) ("represses" (vbz keep-down) (((t^ i) 0) ((p r e s) 1) ((i z) 0))) ("represses" (vbz press-again) (((t^ ii) 3) ((p r e s) 1) ((i z) 0))) ("repressing" (jj keep-down) (((t^ i) 0) ((p r e s) 1) ((i n) 0))) ("repressing" (jj press-again) (((t^ ii) 3) ((p r e s) 1) ((i n) 0))) ("repressing" (nn keep-down) (((t^ i) 0) ((p r e s) 1) ((i n) 0))) ("repressing" (nn press-again) (((t^ ii) 3) ((p r e s) 1) ((i n) 0))) ("repressing" (vbg keep-down) (((t^ i) 0) ((p r e s) 1) ((i n) 0))) ("repressing" (vbg press-again) (((t^ ii) 3) ((p r e s) 1) ((i n) 0)))
To get a better idea of how often this matters in practice, we would need to take a large corpus of text that typifies the type of input text we expect, and count how often one of those 342 words occurs.
To refine that, we should only count the times where its pronunciation would have been incorrect based on POS alone. However, that would be expensive, because we would have to know the correct pronunciation – for example, by manually annotating the text.
ToBI associates accents with words, but in fact intonation events align with syllables. Accents align with a particular syllable in the word (usually one with lexical stress) and their precise timing (earlier or later) can also matter.
In the accent ratio model, the ≤0.05 is saying “no more than 5%” and is a statistical significance test. It is there so that the model only makes predictions in cases where it is confident (e.g., because it has seen enough examples in the training data).
(in future, please can you split each question into a separate post – it makes the forums easier to read and search)
ToBI is a description of the shape of intonation events (i.e., small fragments of F0 contour). We could make a syllable sound more prominent using one of several different shapes of F0 contour; the most obvious choice is a simple rise-fall (H*) but other shapes can also add prominence.
ToBI does also attempt to associate a function with some accent types (e.g., L* for “surprise” in Figure 8.10). But, many people (including me) are sceptical about this functional aspect of ToBI, because there really isn’t a simple mapping between shapes of F0 contours and the underlying meaning.
“IP” means “intonation phrase, as described in 8.3.1. So an “IP-initial accent” is the first accent in an intonation phrase.
There is a note on the page
http://www.speech.zone/courses/speech-processing/synthesis/front-end/cart/
that says
There videos for this part of the course are incomplete. We’ll cover CART in detail in the lectures. But make sure to watch the video in the related post below.
This material is certainly still part of the syllabus.
In the Entropy: understanding the equation I used a few carefully chosen example distributions to help you understand the general formula for entropy.
At 5:30 in the video, you will see that I used this code for sending messages about three values:
Code 1
- green = 0
- blue = 10
- red = 11
but I didn’t go into the precise reason for choosing that code rather than, say
Code 2
- green = 0
- blue = 1
- red = 01
So, let’s clear that detail up now. When we transmit a variable length code, we also have to make it possible for the receiver to know when each item in the message starts and finishes. In other words, for any string of bits, there has to be a single unambiguous decoding of the message.
Consider sending the message “green green blue red”
Using Code 1: 001011
Using Code 2: 00101
At this point, it looks like code 2 is better – it can send the message with fewer bits. But now let’s try to decode them. Using Code 1, the message is unambiguous
001011 = green green blue red , and there are no other possible ways to decode
But using Code 2 we have more than one possible decoding
00101 = green green blue red
00101 – green red blueSo, that code is not allowed!
Your code has the same problem:
- EH = 0
- AA = 1
- AO with 01
because the message “EH AH” is coded as 01, and this cannot be decoded unambiguously (it might mean “EH AH” or it might mean “AO”).
Yes – the log (base 2) is converting to bits.
Log (base 2) comes up quite often in this context, and related ones. For example, the depth of a binary decision tree is of the order of log (base 2) of the number of leaves.
-
AuthorPosts