Forum Replies Created
-
AuthorPosts
-
This algorithm is for preparing the training set for a letter-to-sound model (e.g., a classification tree). The end result of the algorithm is a single alignment between letters and phonemes, for each word in the training set (i.e., a pre-existing pronunciation dictionary).
It’s important to realise that, across the whole training set, a particular letter (e.g., “c”) might align with different phonemes (sometimes /k/, sometimes /ch/, etc) in different words. It won’t necessarily always align with the same phoneme all the time.
So, how do we get to that single alignment? We use a simple unigram model of the probability of each letter aligning with each phoneme. Most of the probabilities in this model will be zero, and the only non-zero probabilities are for those letter-phoneme pairs given in the allowables list.
The key machine learning concept to understand in this algorithm is that of first initialising this unigram model and then iteratively improving the model.
To initialise, and then to improve the model, we need an alignment for all words in the training set, so that we can count how many times each phoneme aligns with each letter. The allowables lists are used to find the first alignment. The model is then updated, and then this improved model is used to find a better alignment.
If the allowables list for a particular letter only contained a single phoneme, then that letter would always have to align with that phoneme. But in general, the allowables lists will have many phonemes for each letter.
You are correct – we measure the entropy of each half of the split, and sum these values (weighted by occupancy) to get the total entropy value for that question.
This is well beyond the scope of what we need to know about POS tagging for speech synthesis! We just need to know that
– POS tagging is very accurate for languages where a large corpus of hand-tagged data is available to train the tagger
– a typical method is HMMs+ngrams with the model being trained on that corpusThe numbers 0, 1, 2 appended to phoneme names refer to lexical stress.
0 = unstressed
1 = primary stress
2 = secondary stressFor phoneme sequences predicted by letter-to-sound, sometimes only 0 and 1 are used, so you might find more than one syllable with stress of “1”.
Syllables are indicated by the bracketing: in the example above, the first syllable is (k ae t) and it has a stress of “1”
Festival has its own algorithm for syllabifying phoneme sequences that have been predicted by the letter-to-sound module. This algorithm does not follow the “maximum onset” principal, and exactly what the method does is a bit obscure to me (I didn’t write that part of Festival).
The method for old diphones voices is different to that for newer unit selection voices (such as the one you will be using for the assessed practical).
A tube that is closed at both ends will have resonances at all (both odd and even) multiples of the lowest resonant frequency. The sound waves travel as follows
- start at closed end (vocal folds)
- travel to other closed end
- reflects back
- travels to back to other closed end
- meets next pulse from vocal folds, is in phase with it, and so gets bigger
The wave has to travel 2 times the length of the tube before meeting the next pulse, in order to be “in step” with it.
For a tube that is closed at one end and open at the other, it’s a little different. The sound wave reflects from the closed end, just as in the other case, but when a wave is reflected from the open end, it is inverted. Therefore, the wave needs to travel as follows before it will be “in step” (i.e., in phase) with the next wave:
- start at closed end (vocal folds)
- travel to open end
- reflects back, but gets inverted in the process
- inverted wave travels to closed end
- is reflected and remains inverted
- travels to open end
- reflects back, and is inverted again (so is back to ‘normal’)
- travels back to closed end
- meets next pulse from vocal folds, is in phase with it, and so gets bigger
You see that the wave needs to traverse 4 times the length of the tube this time.
The higher frequency resonances of these tubes occur by putting pulses in more frequently. Try to work that out using the same reasoning as above.
In principal, yes this is possible in Festival. But it would require advanced Scheme programming skills and a deep understanding of how Festival is implemented.
Instead, try using Praat to modify F0 (and optionally also duration) of either a natural sentence, or a synthetic waveform saved from Festival. What do you need to modify to move the perceived stress from one syllable to another?
Radiation is what happens at the lips as sound waves within the vocal tract are propagated out into the free air.
A detailed understanding of the physics is a little beyond our needs, but here’s a simple way to understand what is happening at the lips:
The sound pressure variations inside the vocal tract are due to waves propagating up and down the tube and being reflected back at both the ends. The air within the vocal tract is approximately, on average, stationary (forget about the flow caused by breathing – it’s very slow compared to the speed of sound).
The radiation effect is what happens when this trapped “piston” of air in the vocal tract causes the air in the free field outside the lips to move, creating sound waves that propagate out from the lips.
The effect is to differentiate the signal, which has the same effect as imposing a filter that boosts higher frequencies, as in Figure 7.7 of “Elements of Acoustic Phonetics” by Ladefoged.
Because this is a constant effect (independent of the settings of the articulators and of F0), it is common to omit this filter and include the effect in the source spectrum. Or, if the source is a simple pulse train with a flat spectrum, the vocal tract filter will include the lip radiation effect.
Don’t think of the instantaneous amplitude (i.e., the value of one sample of the waveform) as how loud the sound will be. That is not the case. The cochlea detects variations in pressure, not absolute pressure. So, it’s the movement of the waveform “up and down” that is important, and not the actual value of individual samples.
The pulse train could just as well be written as having the value -1 almost all the time and then going to 0 for a single sample at each pulse. It would sound the same.
Yes, what we are plotting on a waveform are deviations from the average pressure. These deviations can be positive (compression = air molecules are closer together than average) or negative (rarefaction = air molecules are further apart than average).
I should follow my own rule: Always label both axes!
The pulse train is just a waveform, so it’s in the time domain. You are correct that the horizontal axis is time. The vertical axis should be labelled “amplitude” (which we can think of as sound pressure).
The units of amplitude are arbitrary, and in this example the scale goes from 0 to 1 (all these pulses are positive). We could just as well have labelled it with the sample value (which would be from -32768 to +32767 for a 16bit waveform, and so the pulses would each have an amplitude of 32767).
October 7, 2015 at 09:42 in reply to: Simple Synthetic Vowel: how to make it sound more natural #222Yes, one way would be to use a more complex source than the pulse train. This is what is done in Festival (in diphone and unit selection voices). The source waveform is something called the “residual” and is calculated so that the speech is almost perfectly reconstructed after that source signal is passed through the filter. In other words, the residual compensates for the fact that the filter is an oversimplification of the vocal tract.
We will touch on this at the end of the synthesis section of the course.
October 6, 2015 at 16:33 in reply to: The spectrum of a pure tone is not a perfect vertical line #216That’s a good question, but one with a rather technical answer.
First, it’s worth remembering that we usually view the spectrum on a log scale, and this exaggerates this effect.
The short answer is that this is a consequence of analysing a short region of the signal that – in general – will not contain a perfect integer number of complete cycles of the waveform. Therefore, we have to multiply the waveform by a tapered window to avoid discontinuities at the start and end (see my blog post about what happens without a tapered window).
Fading the signal in and out with the tapered window effectively changes its frequency content: for example, our pure sine wave would not be precisely a pure sine wave anymore (i.e., will now contain some other frequencies, caused by the application of the window function).
This article gives a good, and longer answer. Scroll down to “Windowing” and Figure 10, then read onwards to Figure 13. After that, it becomes a “my window function is better than your window function” competition.
The Wikipedia entry “Window function” has a long shopping list of slightly different window functions. Otherwise, I think that article is long but not very illuminating.
-
AuthorPosts