Page 40

Forum Replies Created

Viewing 15 posts - 586 through 600 (of 1,073 total)

← 1 2 3 … 39 40 41 … 70 71 72 →

Author

Posts
October 6, 2017 at 09:07 in reply to: Assigning stress to words in a sentence #7867
Simon
Professor
Think about the order of processes in the front end pipeline. You’ll see that Part Of Speech (POS) tagging is done before predicting intonation. We can describe POS as very shallow (i.e., without deep structure) syntactic information.

So, although we do not explicitly know the relationships between words, we do have some information about their role in the sentence, in the form of the sequence of POS tags for the words.

The sequence of POS tags is one of the main predictors for intonation. For example, if we wanted to use a CART to predict whether a particular word in a sentence should receive a pitch accent, then the POS of that word and the POS of the surrounding words would be good predictors.

See the last two videos in Module 4 of Speech Processing – Prosody prediction 1, and Prosody prediction 2.
October 5, 2017 at 21:26 in reply to: Syllable structure & stress #7864
Simon
Professor
If you’re interested in how sonority is calculated from broad phonetic class in Festival, this is hard-coded as follows:
```
    if (p->val(f_vc) == "+") // vowel-or-consonant == vowel
        return 5;
    else if (p->val(f_ctype) == "l") // consonant-type == liquid
        return 4;
    else if (p->val(f_ctype) == "n") // consonant-type == nasal
        return 3;
    else if (p->val(f_cvox) == "+") // consonant-voicing == voiced
        return 2;
    else
        return 1;
```
and the phoneme set used by the lexicon will have those features specified in a table (manually created by a phonetician) looking something like this example (which happens to be for Spanish):
```
   (#  - 0 - - - 0 0 -)
   (a  + l 3 1 - 0 0 -)
   (e  + l 2 1 - 0 0 -)
   (i  + l 1 1 - 0 0 -)
   (o  + l 3 3 - 0 0 -)
   (u  + l 1 3 + 0 0 -)
   (b  - 0 - - + s l +)
   (ch - 0 - - + a a -)
   (d  - 0 - - + s a +)
   (f  - 0 - - + f b -)
   (g  - 0 - - + s p +)
   (j  - 0 - - + l a +)
   (k  - 0 - - + s p -)
   (l  - 0 - - + l d +)
   (ll - 0 - - + l d +)
   (m  - 0 - - + n l +)
   (n  - 0 - - + n d +)
   (ny - 0 - - + n v +)
   (p  - 0 - - + s l -)
   (r  - 0 - - + l p +)
   (rr - 0 - - + l p +)
   (s  - 0 - - + f a +)
   (t  - 0 - - + s t +)
   (th - 0 - - + f d +)
   (x  - 0 - - + a a -)
```
October 5, 2017 at 21:17 in reply to: Syllable structure & stress #7863
Simon
Professor
Syllabification of out-of-dictionary words is rule-based, using sonority. Every vowel is assumed to be the nucleus of a syllable. The boundaries between syllables are placed at positions of minimum sonority.

This requires knowing the sonority of every phoneme in the set used by the current lexicon. In Festival, sonority is calculated from the broad phonetic class.

A good reference for sonority would be this classic textbook

Giegerich, H. J. (1992) “English Phonology: an Introduction” Cambridge University Press, Cambridge, UK.

(Heinz Giegerich is the Professor of English Linguistics at Edinburgh University)
October 5, 2017 at 17:34 in reply to: Wavesurfer vs Praat #7860
Simon
Professor
The Nyquist frequency only depends on the sampling rate of the waveform you are analysing (and not on the software package you are using).

Wavesurfer always shows the spectrogram up to the Nyquist frequency.

Praat, by default, only shows up to 5kHz (even when the Nyquist frequency is higher than this value) because this band is of most interest for speech analysis. You can configure this in the spectrogram settings.

So, let me rephrase your question as

Why is 16kHz the most common sampling rate used for speech waveforms?

and see if you can answer that…
October 5, 2017 at 12:54 in reply to: Additional Reading on Fourier transform #7857
Simon
Professor
Try the Handbook of Digital Signal Processing, Chapter 1 – Transforms and Transform Properties, and only read the material on the Fourier transform (the other sorts of transforms are interesting too, but not needed for the Speech Processing course)

DOI: 10.1016/B978-0-08-050780-4.50006-0 – ebook freely available if accessed from the University network, or via this alternative link (EASE authenticated, should work from anywhere)
October 5, 2017 at 08:01 in reply to: Ladefoged – Chapter 4 #7854
Simon
Professor
and here are some slides from an old Phonetics course that I used to teach, about formant bandwidth

Attachments:
You must be logged in to view attached files.
October 2, 2017 at 18:03 in reply to: Fundamental Frequency #7816
Simon
Professor
Sure – it’s a very simple task, but it does require you to learn how to use Wavesurfer, which is something you’ll need later in the course.
October 2, 2017 at 18:00 in reply to: Ladefoged – Chapter 4 #7815
Simon
Professor
Ladefoged is being a little sloppy with his use of the term “repetitive” because there are several sorts of repetition going on in this figure, including one at 100Hz in 4.31(a) and another at about 700Hz (in all subfigures).

Let’s imagine that 4.13(d) is the impulse response of a vocal tract which has a single resonant frequency at 700Hz – it’s the “ringing” of that vocal tract after a single impulse excitation.

The waveform in Figure 4.13(d) is very similar to a single pitch period of the waveform in 4.13(a). In that case, the waveform in Figure 4.13(a) must be the output of the same vocal tract, but this time excited with a sequence of impulses (i.e., an “impulse train”) at 100Hz.

The reason that the spectrum in 4.13(a) has a line structure (i.e., with harmonics at multiples of a fundamental frequency) is because of the repetitive pattern at 100Hz: it’s the evidence of the excitation signal.

The reason that the spectrum in 4.13(d) does not have a line structure is because only a single pitch period is being analysed, and so there is no periodic excitation – just a single impulse.

Damping

The waveform in Figure 4.13(d) is quite a like sine wave at a frequency of 700Hz, except that is has decaying amplitude. If it was simply a 700Hz sine wave of constant amplitude, then its spectrum would be a single vertical line at 700Hz with “no width”. But the decay means that it’s not quite a sine wave. And “not quite” means something very specific: that it must contain other frequencies. That’s why the spectrum isn’t just a single line, but has a width. This is called the bandwidth and is related to the rate of decay.

The decay is a consequence of a physical process called damping: the vocal tract gradually absorbs energy from the signal and so the signal’s amplitude decays over time. More damping (due to softer, fleshier vocal tract walls!) would make the decay faster.

The take-home message

What we are seeing in 4.13(d) is an important property of a linear system, which we’ll be mentioning in the lecture about TD-PSOLA. The Fourier transform (i.e., spectrum) of the impulse response is exactly the same thing as the frequency response of the system (e.g., the filter).
October 1, 2017 at 17:04 in reply to: Fundamental Frequency #7811
Simon
Professor
Yes – that’s correct.
October 1, 2017 at 14:53 in reply to: Fundamental Frequency #7805
Simon
Professor
Measuring the fundamental period for yourself from the waveform is a learning exercise, and so is calculating the fundamental frequency from the fundamental period. Real waveforms don’t tell you their fundamental frequency in their filename!

This task is very easy for the sine wave, pulse train, and square wave. But it’s not always quite so easy for the speech waveform, as you will discover when you try it for yourself.
September 28, 2017 at 15:20 in reply to: Lecture preparation #7770
Simon
Professor
The Nyquist frequency is defined and discussed in this blog post

Sampling and quantisation

which is listed in the “Start” tab of the “Foundations -> Signals” material here

Signals

which was the material you needed to cover before today’s 9.00 lecture. I perhaps didn’t make it clear enough about how to prepare for the foundation lectures – I will clarify that in a message to the class now.
September 28, 2017 at 11:44 in reply to: lecture recordings #7768
Simon
Professor
Recordings (sound + screen capture) of the current year’s lectures can be found on Learn. Left-hand menu “Recordings & slides”, then “Media Hopper Replay”. There’s a small bug at the moment, with recordings stopping at 10.49 and losing the last minute of the lecture. I am trying to get that fixed.
September 25, 2017 at 13:39 in reply to: Ladefoged – Chapter 2 #7762
Simon
Professor
Ladefoged’s point is simple, but maybe his explanation is not. He is saying that the amplitude of a sound and its frequency are independent properties. We can change one, without changing the other.

Let’s see if another example makes it clearer. Let’s pick a nice simple source of sound – a violin string.

A violin player can independently control the note being played (i.e., the frequency) and how loud that note is (i.e., the amplitude).

Now, given that a player can independently control these, and that a listener will perceive them, then the physical signal that is transmitted through the air must have two separate physical properties: one that corresponds to the note’s frequency and another that corresponds to its amplitude.

The physical manifestation of amplitude is the amount by which air particles move back-and-forth. The physical manifestation of frequency is how many times per second they move back-and-forth.

Changing the frequency is shown in Ladefoged’s Fig. 2.3. Changing the amplitude is in Fig. 2.1.

(The perceptual correlate of amplitude is loudness. The perceptual correlate of frequency is pitch. In both cases, the relationship between the physical property and the percept is non-linear.)
April 7, 2017 at 11:02 in reply to: Festival's pitch marking vs pitch tracking #7062
Simon
Professor
The male/female flag provided to the make_f0 script just sets some reasonable values for a set of parameters passed to the pda program. Read the make_f0 script to see what these are, and then read the manual for pda to understand which ones refer to pre- or post-processing.

-L means low-pass filtering, which is pre-processing

-d 1 means decimation (downsampling), but the value 1 means that no decimation is actually used – this is pre-processing

-P means peak tracking, in other words dynamic programming – this is post-processing
April 7, 2017 at 10:51 in reply to: Festival's pitch marking vs pitch tracking #7060
Simon
Professor
When building voices for Festival, we could use any pitch tracker we liked. In the practical exercise, a tool from the Edinburgh Speech Tools library called pda (pitch determination algorithm) is used, which implements the “super resolution pitch determination algorithm“.

We could just as well have used Talkin’s “RAPT” method, which is available in a program called get_f0.

The small differences between methods ares not important for your understanding of the general principles behind pitch tracking.

I recommend only trying to understand RAPT, and section 3.1 of the paper will tell you exactly with version of the autocorrelation function is used in that method.
Author

Posts

Viewing 15 posts - 586 through 600 (of 1,073 total)

← 1 2 3 … 39 40 41 … 70 71 72 →

Simon

Forum Replies Created

Attachments:

Search the forums

Note

Latest Activity

Search the forums

Speech Synthesis