Forum Replies Created
-
AuthorPosts
-
Think about the order of processes in the front end pipeline. You’ll see that Part Of Speech (POS) tagging is done before predicting intonation. We can describe POS as very shallow (i.e., without deep structure) syntactic information.
So, although we do not explicitly know the relationships between words, we do have some information about their role in the sentence, in the form of the sequence of POS tags for the words.
The sequence of POS tags is one of the main predictors for intonation. For example, if we wanted to use a CART to predict whether a particular word in a sentence should receive a pitch accent, then the POS of that word and the POS of the surrounding words would be good predictors.
See the last two videos in Module 4 of Speech Processing – Prosody prediction 1, and Prosody prediction 2.
If you’re interested in how sonority is calculated from broad phonetic class in Festival, this is hard-coded as follows:
if (p->val(f_vc) == "+") // vowel-or-consonant == vowel return 5; else if (p->val(f_ctype) == "l") // consonant-type == liquid return 4; else if (p->val(f_ctype) == "n") // consonant-type == nasal return 3; else if (p->val(f_cvox) == "+") // consonant-voicing == voiced return 2; else return 1;
and the phoneme set used by the lexicon will have those features specified in a table (manually created by a phonetician) looking something like this example (which happens to be for Spanish):
(# - 0 - - - 0 0 -) (a + l 3 1 - 0 0 -) (e + l 2 1 - 0 0 -) (i + l 1 1 - 0 0 -) (o + l 3 3 - 0 0 -) (u + l 1 3 + 0 0 -) (b - 0 - - + s l +) (ch - 0 - - + a a -) (d - 0 - - + s a +) (f - 0 - - + f b -) (g - 0 - - + s p +) (j - 0 - - + l a +) (k - 0 - - + s p -) (l - 0 - - + l d +) (ll - 0 - - + l d +) (m - 0 - - + n l +) (n - 0 - - + n d +) (ny - 0 - - + n v +) (p - 0 - - + s l -) (r - 0 - - + l p +) (rr - 0 - - + l p +) (s - 0 - - + f a +) (t - 0 - - + s t +) (th - 0 - - + f d +) (x - 0 - - + a a -)
Syllabification of out-of-dictionary words is rule-based, using sonority. Every vowel is assumed to be the nucleus of a syllable. The boundaries between syllables are placed at positions of minimum sonority.
This requires knowing the sonority of every phoneme in the set used by the current lexicon. In Festival, sonority is calculated from the broad phonetic class.
A good reference for sonority would be this classic textbook
Giegerich, H. J. (1992) “English Phonology: an Introduction” Cambridge University Press, Cambridge, UK.
(Heinz Giegerich is the Professor of English Linguistics at Edinburgh University)
The Nyquist frequency only depends on the sampling rate of the waveform you are analysing (and not on the software package you are using).
Wavesurfer always shows the spectrogram up to the Nyquist frequency.
Praat, by default, only shows up to 5kHz (even when the Nyquist frequency is higher than this value) because this band is of most interest for speech analysis. You can configure this in the spectrogram settings.
So, let me rephrase your question as
Why is 16kHz the most common sampling rate used for speech waveforms?
and see if you can answer that…
Try the Handbook of Digital Signal Processing, Chapter 1 – Transforms and Transform Properties, and only read the material on the Fourier transform (the other sorts of transforms are interesting too, but not needed for the Speech Processing course)
DOI: 10.1016/B978-0-08-050780-4.50006-0 – ebook freely available if accessed from the University network, or via this alternative link (EASE authenticated, should work from anywhere)
and here are some slides from an old Phonetics course that I used to teach, about formant bandwidth
Attachments:
You must be logged in to view attached files.Sure – it’s a very simple task, but it does require you to learn how to use Wavesurfer, which is something you’ll need later in the course.
Ladefoged is being a little sloppy with his use of the term “repetitive” because there are several sorts of repetition going on in this figure, including one at 100Hz in 4.31(a) and another at about 700Hz (in all subfigures).
Let’s imagine that 4.13(d) is the impulse response of a vocal tract which has a single resonant frequency at 700Hz – it’s the “ringing” of that vocal tract after a single impulse excitation.
The waveform in Figure 4.13(d) is very similar to a single pitch period of the waveform in 4.13(a). In that case, the waveform in Figure 4.13(a) must be the output of the same vocal tract, but this time excited with a sequence of impulses (i.e., an “impulse train”) at 100Hz.
The reason that the spectrum in 4.13(a) has a line structure (i.e., with harmonics at multiples of a fundamental frequency) is because of the repetitive pattern at 100Hz: it’s the evidence of the excitation signal.
The reason that the spectrum in 4.13(d) does not have a line structure is because only a single pitch period is being analysed, and so there is no periodic excitation – just a single impulse.
Damping
The waveform in Figure 4.13(d) is quite a like sine wave at a frequency of 700Hz, except that is has decaying amplitude. If it was simply a 700Hz sine wave of constant amplitude, then its spectrum would be a single vertical line at 700Hz with “no width”. But the decay means that it’s not quite a sine wave. And “not quite” means something very specific: that it must contain other frequencies. That’s why the spectrum isn’t just a single line, but has a width. This is called the bandwidth and is related to the rate of decay.
The decay is a consequence of a physical process called damping: the vocal tract gradually absorbs energy from the signal and so the signal’s amplitude decays over time. More damping (due to softer, fleshier vocal tract walls!) would make the decay faster.
The take-home message
What we are seeing in 4.13(d) is an important property of a linear system, which we’ll be mentioning in the lecture about TD-PSOLA. The Fourier transform (i.e., spectrum) of the impulse response is exactly the same thing as the frequency response of the system (e.g., the filter).
Yes – that’s correct.
Measuring the fundamental period for yourself from the waveform is a learning exercise, and so is calculating the fundamental frequency from the fundamental period. Real waveforms don’t tell you their fundamental frequency in their filename!
This task is very easy for the sine wave, pulse train, and square wave. But it’s not always quite so easy for the speech waveform, as you will discover when you try it for yourself.
The Nyquist frequency is defined and discussed in this blog post
which is listed in the “Start” tab of the “Foundations -> Signals” material here
which was the material you needed to cover before today’s 9.00 lecture. I perhaps didn’t make it clear enough about how to prepare for the foundation lectures – I will clarify that in a message to the class now.
Recordings (sound + screen capture) of the current year’s lectures can be found on Learn. Left-hand menu “Recordings & slides”, then “Media Hopper Replay”. There’s a small bug at the moment, with recordings stopping at 10.49 and losing the last minute of the lecture. I am trying to get that fixed.
Ladefoged’s point is simple, but maybe his explanation is not. He is saying that the amplitude of a sound and its frequency are independent properties. We can change one, without changing the other.
Let’s see if another example makes it clearer. Let’s pick a nice simple source of sound – a violin string.
A violin player can independently control the note being played (i.e., the frequency) and how loud that note is (i.e., the amplitude).
Now, given that a player can independently control these, and that a listener will perceive them, then the physical signal that is transmitted through the air must have two separate physical properties: one that corresponds to the note’s frequency and another that corresponds to its amplitude.
The physical manifestation of amplitude is the amount by which air particles move back-and-forth. The physical manifestation of frequency is how many times per second they move back-and-forth.
Changing the frequency is shown in Ladefoged’s Fig. 2.3. Changing the amplitude is in Fig. 2.1.
(The perceptual correlate of amplitude is loudness. The perceptual correlate of frequency is pitch. In both cases, the relationship between the physical property and the percept is non-linear.)
The male/female flag provided to the
make_f0
script just sets some reasonable values for a set of parameters passed to thepda
program. Read themake_f0
script to see what these are, and then read the manual for pda to understand which ones refer to pre- or post-processing.-L means low-pass filtering, which is pre-processing
-d 1 means decimation (downsampling), but the value 1 means that no decimation is actually used – this is pre-processing
-P means peak tracking, in other words dynamic programming – this is post-processing
When building voices for Festival, we could use any pitch tracker we liked. In the practical exercise, a tool from the Edinburgh Speech Tools library called
pda
(pitch determination algorithm) is used, which implements the “super resolution pitch determination algorithm“.We could just as well have used Talkin’s “RAPT” method, which is available in a program called
get_f0
.The small differences between methods ares not important for your understanding of the general principles behind pitch tracking.
I recommend only trying to understand RAPT, and section 3.1 of the paper will tell you exactly with version of the autocorrelation function is used in that method.
-
AuthorPosts