› Forums › Speech Processing – Live Q&A Sessions › Module 3 – Digital Speech Signals
- This topic has 4 replies, 3 voices, and was last updated 6 months, 2 weeks ago by
Catherine Lai.
-
AuthorPosts
-
-
October 4, 2025 at 12:22 #18438
When doing Fourier Analysis, from my understanding we are breaking down a complex wave into multiple simple sine waves, giving the harmonics (multiples) of f0. The part I’m confused about is how formants are related to harmonics: are they the frequencies in which harmonics have largest amplitudes? If so, how are formants independent from F0, as described in the quiz for module 4?
-
October 6, 2025 at 17:58 #18440
Hi Shahed,
There are a few different things going on here, which we will actually talk about more in the module 4 lecture (the Source-Filter model). I’ll lay out a brief summary here which will hopefully make the relationships clearer (though maybe it is more than you are asking for!).
1. Voice source, F0 and its harmonics: You can think of vocal fold vibrations are generating sounds at a particular F0. That F0 corresponds to the rate of vocal fold vibrations (opening and closing). Each opening results in a short burst of air pressure (i.e., impulse). Because it’s coming out of tube but air is bouncing back and forward, multiples of F0 are also amplified (i.e., harmonics). If F0 is 100Hz, then you are going to get harmonics of 100Hz, 200Hz, 300 Hz, etc.
2. Vocal tract as filter: If you just had the voice box and no vocal tract above that, you’d just get F0 and it’s harmonics. This is equivalent to the sound source “duck call” here: https://annex.exploratorium.edu/exhibits/vocal_vowels/vocal_vowels.html
Thankfully we do have vocal tracts! The shape that we create with our mouths and tongue position creates a filter that boosts some of the harmonics (that come from the vocal source) and dampens others. These correspond to the resonances of the vocal tract (as opposed to the vocal source), and is what we call formants in a spectrogram. We usually see these a dark bands spanning several harmonics. When we have out tongue in a kinda central position (the “schwa” vowel, ə), the formants are usually around (F1) 500Hz, (F2) 1500 Hz, (F3) 2500 Hz (we’ll talk a bit about tube models this week). These formant frequencies vary with tongue position.
So formants are independent of F0 in the sense that you can change your F0 but keep your vocal tract the same shape (same vowel, different pitch), and vice versa. The frequencies we see as boosted/dampened in a spectrogram will be harmonics (as that is what comes form the sound source), but the fact that they are boosted/dampened is independently controlled by the vocal tract shape (the filter).
3. The Discrete Fourier Transform: The DFT picks out what “pure tone” frequencies are in an input waveform. The spectrogram shading is basically a representation of the magnitude of specific frequencies from a set of frequencies determined by the input window size and sampling rate. This is independent of the F0 and harmonics of the input speech signal. The default window in Praat is small enough that you generally don’t see the fine detail of the harmonics in the spectrogram, just the formants as a kind of blurring over them (see module 2 lab).
The frequencies that the DFT can pick out faithfully depends on the number of input samples you give it. So for a fixed sampling rate, the longer the time window the more samples. So if my sampling rate was 8000 Hz, and my input window was 20 samples, I would be able to pick out 20 frequencies equally spread between 0 to 8000 Hz: 400 Hz, 800 Hz, etc. However, because of aliasing, only half of these are actually meaningful (the first half, up to half the sampling rate).
If we were to use this window size, but our F0 was 100 Hz, we wouldn’t be able to see the harmonics (multiples of 100 Hz) at all. Those frequencies would get picked up but they would be more or less aggregated over those 400 Hz bins, so blurred out on the frequency axis.
So, you can think of the source, filter, and the DFT as 3 separate things. The settings of the DFT (i.e., in terms of input size, sampling rate) determine the time and frequency resolution of a spectrogram and hence what exactly you can see of the source and the filter in a spectrogram.
cheers,
Catherine -
October 8, 2025 at 18:23 #18442
Hi guys, as I was going through the answers for Lab 3 (the first signal lab), I was confused by this part of the answer for the tasks on phase shifts: “With a negative phase shift you move the sinusoid “back” in time, so at with a phase shift of -90 degrees, we start with an amplitude of -1.” I get the first part, which is on moving the sinusoid back in time, but why does it make the amplitude -1? Thanks! Also, what is the answer for this question “What happens to the combined wave if you change the second sine wave to be very close but not quite the same as the first sine wave (e.g. 300 vs 310 Hz)?”, I didn’t find the answer in the answer sheet. Thank you!
-
October 9, 2025 at 12:16 #18444
Hi Zhujun,
We can think of a sine wave as the vertical projection of a “clock hand” vector rotating around a circle where we start at the coordinate (1, 0) as in the attached figure. At each time forward in time we move a step counter-clockwise. Moving backward in time is then equivalent to rotating clockwise. In this way, positive phase shifts move the rotating vector counter-clockwise (forward), while negative phase shifts move the rotating vector clockwise.
So, a phase shift of -90 degrees (-pi/2) moves the starting point (i.e., associated with time 0) clockwise (backwards) to the coordinate (-1, 0), as in the attached image. Once it starts rotating though it will keep going counter-clockwise (see the animation in the lab notes). So it’s as if you push the default sine wave to the right and start at the -1 amplitude point.
Since we’re essentially going endlessly around the circle on the left hand side of the attached figure, we can also see that a -90 degree phase shift gets us to the same starting point as a +270 degree (3pi/4) phase shift. So they are actually equivalent.
Attachments:
You must be logged in to view attached files. -
October 9, 2025 at 12:50 #18448
For the question about 310Hz and 300Hz sine waves being added together, you need to zoom out a bit to actually see what’s happening. In the attached image I’ve zoomed out so that the x-axis (time) goes from 0 to 0.5 seconds.
You’ll see from the combine waved (3rd plot), that we now have a low frequency component appearing visible. There are in fact, 5 cycles of this in the plot in 0.5 second, so this is a new 10 Hz component. If you do this yourself in the notebook and listen to the sound, you’ll be able to hear this as a pulsing sound. This is often terms a ‘beat’ frequency.
What’s happening here is that the two sine waves are interfering with each other. Adding them together sometimes boosts the signal (‘constructive’ interference, but sometimes dampens the signal (‘destructive’ inteference). This happens in the regular pattern of amplitude changes we see in that figure.
If you’ve ever played a string instrument (e.g., the violin) you would probably have learned to listen for these sort of beats when tuning the instrument (presence of beat frequencies means your strings are out of tune relative to each other).
Here the beat frequency (10 Hz) is the greatest common divisor of the two input frequencies, which in this case is the difference in Hz between the two inputs.
It’s not in scope for this course but you see more details here:
https://www.animations.physics.unsw.edu.au/waves-sound/interference/index.html#7.4 (video)https://www.animations.physics.unsw.edu.au/jw/beats.htm (text with derivations and other info).
Thanks for flagging that the answer to that was missing in the answer sheet!
cheers,
CatherineAttachments:
You must be logged in to view attached files.
-
-
AuthorPosts
- You must be logged in to reply to this topic.
This is the new version. Still under construction.