Vocal tract resonance & formants

A speaker can vary their vocal tract shape to change its resonant frequencies, and therefore the spectral envelope of the speech they are producing.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoWe've understood the idea of resonant tubes.
We know the vocal tract is a tube, so if you put energy into the vocal tract at (or close to) its resonant frequency, you get a large response: a large output.
So that's how it's possible for your vocal folds to create really loud sounds.
Think about shouting or singing.
An opera singer doesn't use a microphone yet can be heard over a full orchestra of 60 people.
That's amazing.
We need to understand how the vocal tract shape controls these resonant frequencies so that we can use them to send a message to a listener.
We're going to understand that the vocal tract has multiple resonances and then give them their linguistic name of 'formants'.
Our understanding of resonance started in the time domain.
That's where the physical processes occur.
But, it's so often the case that it's easier to describe it in the frequency domain.
Let's plot what the output of a resonator might look like in response to input at a particular frequency.
Let's make a plot on some axes where this is going to be the frequency of the energy input into the resonator.
This is going to be the magnitude of the response (of the output).
Always label the axes!
Let's have this going like this, in kHz
Let's imagine a resonator that happens to resonate, I don't know, at 3.5 kHz.
If we put energy in at that resonant frequency, we get a large output.
If we put energy in far away from that resonant frequency, we get little or no output.
At frequencies very close to the resonant frequency, we'll get some output.
So the response curve of our resonator might look like this.
It's a peak.
Now, a peak in the spectrum reminds me of something we've seen in speech.
We get energy peaks when we look at, for example, the spectral envelope.
Let's develop our understanding of how these peaks are a consequence of resonance in the vocal tract.
But let's just go back to physics for a moment because we need to understand that the vocal tract actually has multiple resonances.
There's our curved vocal tract.
We simplify it as usual, as a straight tube.
We'll only model it in one dimension, so we'll forget that it's a round tube.
Just draw a 1-dimensional picture of it.
We've understood that this has resonance.
We know that it has a least one resonant frequency, and that's related to its length.
I've drawn a slightly more realistic tube than I did last time.
It's now open at one end.
So this end is the glottis.
Here are the lips.
More often than not, when we're speaking, our lips need to be open.
Sound waves produced at the glottis propagate down the tube and when they reach the open end, they are still reflected back.
Now, the process by which a sound wave is reflected by an open-ended tube is absolutely fascinating!
But it's something I'm afraid we're going to skip quickly past because this isn't a course on acoustics, but on speech processing.
Likewise, if this was a course on acoustics, we would also have to explain that even this plain tube with uniform cross-sectional area has multiple resonant frequencies and those are multiples of the lowest resonant frequency.
Those details we can gloss over because our goal here is to understand that by changing the shape of the tube, we can have multiple resonant frequencies.
So let's do that.
Here's a tube with varying cross-sectional area.
There's a back tube and it has a particular length.
There's a front tube and that has some length.
Each of these tubes will have its own resonant frequency (or multiple resonances, of course).
I want you to vary the shape of your own vocal tract in as many different ways as possible.
Try and make this shape, for example.
Think about what parts of your anatomy you are moving when you do that.
Pause the video.
Hopefully, you found quite a few different ways to do that - to, for example, change the tube shape to this one.
At the very least, you can open and close your jaw.
You can move your tongue up and down or front to back, and you could protrude your lips.
That's called 'rounding'.
In other words, you have conscious control over your vocal tract shape, and therefore you have control over its resonant frequencies.
So what do these resonances look like in the spectrum of a speech signal?
Back to the frequency domain.
We'll draw a plot now of a tube with multiple resonant frequencies that's being excited by a periodic signal.
I'm keeping it simple, and I'm assuming there are only two resonant frequencies.
In reality, there can be more - in fact a variable number - but it's the first two that are most important.
The peaks are called formants and their frequencies are the formant frequencies.
Here's is the first one, and here's the second one.
They have names.
The lowest one is always called the 'first formant' or F1 on the second one is always called the 'second formant' or F2.
That notation is usually taken to mean the frequency of those formants.
Now this notation is potentially confusing because of F0, the fundamental frequency of the vocal folds.
They're all frequencies, but they're coming from very different sources.
F0 is the rate of vibration of the vocal folds.
F1 and F2 - and any higher formants, if there are any - are properties of the vocal tract.
By starting from an explanation of the physics in the time domain, we've reached an understanding that the vocal tract can be modelled as a tube with varying cross-sectional area, which means it's got variable tubes within it, and those tubes have their own resonances.
Those resonances are called formants.
What we need to do now is to get from there to a complete computational model that can generate speech signals.
We're going to generalise this idea of a resonant tube into the idea of a filter.
That's something that takes an input and produces an output.
The vocal tract is a filter.
The input, for example, is the energy from the vocal folds, and the output is speech.