Forum Replies Created
-
AuthorPosts
-
When we look at a waveform, what we are seeing is amplitude on the vertical axis. Intensity is proportional to amplitude squared. Intensity is a measure of the energy that this waveform is carrying.
We don’t need to get hung up on this. Although amplitude does have units (it is the sound pressure, which has units of Newtons per square metre), we don’t usually write these units on the vertical axis of a waveform. That’s because our microphone and soundcard are not calibrated.
It’s also important to remember that neither amplitude nor intensity are the same thing as loudness, which is a perceptual phenomenon and varies with the frequency of the sound.
You are right that some vowels have what is called “intrinsic pitch” (which should really be “intrinsic F0”). The effect is small.
This article by Ohala & Eukel lays out some explanations for this in terms of vocal tract physiology:
I’m not sure how perceptually relevant this effect is. In a unit selection system, the effect will implicitly be taken care of because the system uses natural recordings of speech.
For words not in the dictionary, the letter-to-sound model (a classification tree) is used to predict the pronunciation. For each letter in the word, the classification tree predicts the phoneme (or epsilon, or two phonemes).
The predictees are the letter currently being considered and some context around that (e.g., +/- 3 letters).
Let’s assume that your example word “fine” is not in the lexicon. When predicting the sound for the letter “i” the predictees will be:
null null f i n e null
so we can see that the word-final “e” is one of the predictors, and so is available to the classification tree when predicting the sound of the letter “i”. For your other example word “fin” the predictees will be
null null f i n null null
and since the predictees are different, the classification tree is able to separate the two cases using the question
Is the next-next letter = “e”
which has the answer YES for “fine” and NO for “fin”
Yes, the words in the training set are hand-labelled with the pronunciation: this is just a dictionary. See this topic.
At synthesis time, the dictionary will be used in preference to the letter-to-sound model for all words in the dictionary. The letter-to-sound model will only be used for words not in the dictionary.
A1: Pitch is often described on a musical scale. This is a relative scale in which 1 octave corresponds to a doubling in fundamental frequency, and an octave is divided into 12 semitones. This musical scale is effectively log F0. It is therefore common to use log F0 instead of actual F0 when modelling it (e.g., for speech synthesis). Another unit that is widely used to describe frequencies on a perceptual scale is the Mel scale. For more about this, see Automatic Speech Recognition.
A2: Pitch is the perceptual consequence of F0. Pitch is qualitative (i.e., we need human listeners to describe their perceptions) and F0 is quantitative (i.e., we can measure it objectively from a signal). In speech, they are directly related and for our purposes it is fine to state that our perception of pitch does depend only on F0.
A3: F0 mainly depends on suprasegmental properties of an utterance and not the individual phones in it. Any vowel can be spoken at any F0 (within reason) and still be perceived as that same vowel.
A4: Correct: the quality of a vowel is determined by its formant frequencies and not its F0. See A3 above.
This algorithm is for preparing the training set for a letter-to-sound model (e.g., a classification tree). The end result of the algorithm is a single alignment between letters and phonemes, for each word in the training set (i.e., a pre-existing pronunciation dictionary).
It’s important to realise that, across the whole training set, a particular letter (e.g., “c”) might align with different phonemes (sometimes /k/, sometimes /ch/, etc) in different words. It won’t necessarily always align with the same phoneme all the time.
So, how do we get to that single alignment? We use a simple unigram model of the probability of each letter aligning with each phoneme. Most of the probabilities in this model will be zero, and the only non-zero probabilities are for those letter-phoneme pairs given in the allowables list.
The key machine learning concept to understand in this algorithm is that of first initialising this unigram model and then iteratively improving the model.
To initialise, and then to improve the model, we need an alignment for all words in the training set, so that we can count how many times each phoneme aligns with each letter. The allowables lists are used to find the first alignment. The model is then updated, and then this improved model is used to find a better alignment.
If the allowables list for a particular letter only contained a single phoneme, then that letter would always have to align with that phoneme. But in general, the allowables lists will have many phonemes for each letter.
You are correct – we measure the entropy of each half of the split, and sum these values (weighted by occupancy) to get the total entropy value for that question.
This is well beyond the scope of what we need to know about POS tagging for speech synthesis! We just need to know that
– POS tagging is very accurate for languages where a large corpus of hand-tagged data is available to train the tagger
– a typical method is HMMs+ngrams with the model being trained on that corpusThe numbers 0, 1, 2 appended to phoneme names refer to lexical stress.
0 = unstressed
1 = primary stress
2 = secondary stressFor phoneme sequences predicted by letter-to-sound, sometimes only 0 and 1 are used, so you might find more than one syllable with stress of “1”.
Syllables are indicated by the bracketing: in the example above, the first syllable is (k ae t) and it has a stress of “1”
Festival has its own algorithm for syllabifying phoneme sequences that have been predicted by the letter-to-sound module. This algorithm does not follow the “maximum onset” principal, and exactly what the method does is a bit obscure to me (I didn’t write that part of Festival).
The method for old diphones voices is different to that for newer unit selection voices (such as the one you will be using for the assessed practical).
A tube that is closed at both ends will have resonances at all (both odd and even) multiples of the lowest resonant frequency. The sound waves travel as follows
- start at closed end (vocal folds)
- travel to other closed end
- reflects back
- travels to back to other closed end
- meets next pulse from vocal folds, is in phase with it, and so gets bigger
The wave has to travel 2 times the length of the tube before meeting the next pulse, in order to be “in step” with it.
For a tube that is closed at one end and open at the other, it’s a little different. The sound wave reflects from the closed end, just as in the other case, but when a wave is reflected from the open end, it is inverted. Therefore, the wave needs to travel as follows before it will be “in step” (i.e., in phase) with the next wave:
- start at closed end (vocal folds)
- travel to open end
- reflects back, but gets inverted in the process
- inverted wave travels to closed end
- is reflected and remains inverted
- travels to open end
- reflects back, and is inverted again (so is back to ‘normal’)
- travels back to closed end
- meets next pulse from vocal folds, is in phase with it, and so gets bigger
You see that the wave needs to traverse 4 times the length of the tube this time.
The higher frequency resonances of these tubes occur by putting pulses in more frequently. Try to work that out using the same reasoning as above.
In principal, yes this is possible in Festival. But it would require advanced Scheme programming skills and a deep understanding of how Festival is implemented.
Instead, try using Praat to modify F0 (and optionally also duration) of either a natural sentence, or a synthetic waveform saved from Festival. What do you need to modify to move the perceived stress from one syllable to another?
Radiation is what happens at the lips as sound waves within the vocal tract are propagated out into the free air.
A detailed understanding of the physics is a little beyond our needs, but here’s a simple way to understand what is happening at the lips:
The sound pressure variations inside the vocal tract are due to waves propagating up and down the tube and being reflected back at both the ends. The air within the vocal tract is approximately, on average, stationary (forget about the flow caused by breathing – it’s very slow compared to the speed of sound).
The radiation effect is what happens when this trapped “piston” of air in the vocal tract causes the air in the free field outside the lips to move, creating sound waves that propagate out from the lips.
The effect is to differentiate the signal, which has the same effect as imposing a filter that boosts higher frequencies, as in Figure 7.7 of “Elements of Acoustic Phonetics” by Ladefoged.
Because this is a constant effect (independent of the settings of the articulators and of F0), it is common to omit this filter and include the effect in the source spectrum. Or, if the source is a simple pulse train with a flat spectrum, the vocal tract filter will include the lip radiation effect.
Don’t think of the instantaneous amplitude (i.e., the value of one sample of the waveform) as how loud the sound will be. That is not the case. The cochlea detects variations in pressure, not absolute pressure. So, it’s the movement of the waveform “up and down” that is important, and not the actual value of individual samples.
The pulse train could just as well be written as having the value -1 almost all the time and then going to 0 for a single sample at each pulse. It would sound the same.
Yes, what we are plotting on a waveform are deviations from the average pressure. These deviations can be positive (compression = air molecules are closer together than average) or negative (rarefaction = air molecules are further apart than average).
-
AuthorPosts