Forum Replies Created
-
AuthorPosts
-
Jonas is correct.
An astute question!
A pragmatic reason that CARTs are used extensively in Festival is because the toolkit behind Festival (called “Edinburgh Speech Tools”) has a CART-building program (called “Wagon”).
There are of course many alternative, and sometimes better, forms of machine learning, such as Neural Networks. If Festival was re-written today, it would not use CARTs quite so extensively.
In the Speech Processing course, you should see prosody prediction using CARTs mainly as an example of how CARTs can be applied to both classification and regression problems. Don’t see it as representing the state-of-the-art in prosody prediction for TTS.
I think you are asking about the distribution of labels at a leaf of the tree – is that what you mean?
In general, with real data, we will not get pure leaves (i.e., all data points have a single label). So, we can say that there is always a distribution of labels at every leaf.
The question then becomes: how do we make use of that, when making predictions for unseen test data points? There are two possibilities:
- give the test data point the majority label of the leaf that it reaches
- give the test data point a probabilistic label (i.e., a distribution across all possible label values) of the leaf that it reaches
In the second case, some subsequent process will have to resolve the uncertainty about the label – perhaps by using additional information such as the sequence of labels assigned to preceding and following points (in a sequence).
The video is a little informal, and you spotted that I didn’t actually specify what stopping criterion I used.
What’s the stopping criterion?
Several different stopping criteria might have prevented me splitting any further at that node. For example:
- the number of data points at this node fell below a fixed threshold
- none of the remaining questions gave a useful split of the data (i.e., none of them could decrease the entropy by much)
- any remaining questions that did actually reduce entropy also resulted in a split in which one of the partitions had a very small number of data points
Can you explain the reasoning behind each of these different stopping criteria? Can you think of any other stopping criteria?
How would having a leaf with an evenly split distribution help sort all the unlabelled samples?
The reasoning here is that unbalanced splits mean that the select question only applies to a small number of data points, and so is less likely to also apply to the test data. Balanced splits result from questions that apply to “around half” of the data points, and so we are a lot more confident that these questions will generalise to unseen test data points.
Let’s also change your terminology: instead of “sort all the unlabelled samples” you should say “makes accurate predictions of the labels of unseen test samples”.
Do you need the criterion in cases with huge sets of examples, where having no specified stopping point could result in an unmanageable number of splits?
It’s not really a problem to have a deep tree, so the number of splits can’t really become “unmanageable”.
Can you write down a formula that calculates the depth of the tree (average number of questions/splits/nodes between root and leaves), as a function of the number of leaves?
Features must be specified by the system designer, using knowledge of the problem. If in doubt, every possible feature that the designer can think of is extracted – the tree building algorithm will select the most useful ones.
Questions about features can be hand-designed, but can often be automatically created from the features (e.g., enumerate all possible yes/no questions about every possible value of every feature).
At each iteration of tree building (i.e., splitting a node), all available questions are tried, and the one that most reduces the entropy (in the case of a classification task) is placed into the tree at that node.
ToBI is symbolic – all prosodic events (accents or boundary tones) are categorised into a set of classes (the most common accent being H*, which is a rise-fall shape).
Tilt is parametric and represents intonation events as a small set of continuous parameters that describe the shape.
ToBI is suitable for hand-labelling prosody (although this is a highly skilled task). Tilt is not designed for hand-labelling: it is for converting the F0 contours of events (which might be themselves manually or automatically found) into numbers which we can then model.
The material on predicting prosody shows how prediction can be done in several stages. We might use ToBI for the categories in the first stages (placement of events, classification of their types) and perhaps then use Tilt as the parametric representation of shape for realisation as a fragment of F0 contour.
It’s important to understand that prosody is very much an unsolved problem in TTS, and so everything you read on prosody should be seen in that light. There is no single method that “just works” and each system will probably take a different approach.
Try searching for ToBI in the forums to find related topics.
It’s also worth noting that certain users need exceptionally high speaking rates. Blind computer programmers are one example.
Some people may still use formant synthesis, depending on personal preference.
In this topic I noted that we generally avoid performing such signal modifications in unit selection speech synthesis because they degrade the quality. However, speed-up of a synthetic voice is sometimes necessary (e.g., it is required by blind users) and the paper referenced in the above thread compares several ways to do that.
In older diphone synthesis, it was necessary to manipulate duration and F0 independently, and there are several ways to do that. You can try TD-PSOLA for yourself in Praat.
In unit selection (i.e., concatenative) speech synthesis, we generally avoid making any modifications to the recorded speech, because they will introduce artefacts and so degrade naturalness. It’s much easier to vary speaking rate in statistical parametric synthesis.
To read our recent research in this area, try
Cassia Valentini-Botinhao, Markus Toman, Michael Pucher, Dietmar Schabus, and Junichi Yamagishi. “Intelligibility of time-compressed synthetic speech: Compression method and speaking style” in Speech Communication, Volume 74, Nov 2015, pp 52–64. DOI: http://dx.doi.org/10.1016/j.specom.2015.09.002
One finding is that linear time compression of normal speech is the best strategy. This is not what happens in natural speech though. As you point out, natural fast speech does indeed involve more and more deletions, but these seem to harm intelligibility and so we must conclude they are done to benefit the speaker (less effort) and not the listener.
Yes, there is – it’s described here, as part of the coursework for the Speech Synthesis course.
That will tell you the utterance number, and then you can look that up in the list of ARCTIC sentences that was used as the recording script for the
cstr_edi_awb_arctic_multisyn
voice.To listen to the original source sentence, find the appropriate wav file here.
It’s a common misconception that a language with fixed inventory of phonological units (whether Consonant-Vowel units, syllables, or whatever) can be perfectly synthesised from an inventory containing a single recording of each such unit.
All languages have a fixed inventory of phonemes (it’s not possible to invent new ones!) and also of syllables (due to phonotactic constraints), or whatever the equivalent is in that language (e.g., the mora in Japanese).
The key point is that the acoustic realisation of each unit is influenced by its context, and so having multiple recordings of each (from many different contexts) will give better results.
Tone languages still have intonation. Tone by itself does not entirely determine F0. Tone is typically realised as the shapes of an F0 contour, not absolute values. In tone languages, F0 is carrying both segmental and supra-segmental information.
Anything that is not in the pronunciation dictionary will have to be dealt with by the Letter-To-Sound (LTS) model.
[I’m merging this into “Jurafsky & Martin – Chapter 8” where a related question about pronunciation of names has been asked previously]
Nothing will go wrong when using any number of bits.
However, the choices of 8 and 16 are the most convenient because of the way computers store numbers. 8 bits is one byte and corresponds to a
char
in software (in C family languages). 16 bits corresponds to either anint
orshort
in software.Using 9 bits, for example, would be very inconvenient when writing software, since there is no built-in type that is of exactly that size.
Deep down in the operating system (in fact, in the hardware), everything is stored with a fixed number of bits. In modern operating systems, this is now usually 64 bits (older computers used 32 bits). The operating system can very neatly pack 8 or 16 bit numbers into memory. It would be messy to pack 9 bit numbers into memory, and also wasteful since we couldn’t use all 64 bits.
[please ask one question per post – see the topic bit depth for your second question]
dB is always 10 log(ratio) because the 10 is converting from Bels to decibels
the 20 comes from 10 log (ratio^2), where the raising to the power of 2 is converting magnitude to power
remember that log(x . x) = log(x) + log(x) = 2 log(x)
the key point to remember is that dB is a logarithmic scale, which is generally much more useful, when plotting a spectrum for example
It was a typo on the page – the video really is just 42 seconds long – nothing is missing
-
AuthorPosts