Phoneme

The source-filter model brings together our understanding of speech signals, speech production, and phonetics. It can generate any speech sound: any phoneme.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoWe already developed the source-filter model, and we used it to understand speech production, and then we used it to synthesise speech.
So make sure you've fully understood this model before proceeding, because now we're going to use it to understand what a phoneme is.
Speakers have conscious control over many aspects of their speech.
They can choose how and where to generate the basic sound, such as vocal fold vibration, or frication at a constriction.
They can then modify that basic sound by passing it through the vocal tract filter, which they can vary the shape of.
As always, a simplified model will help us understand that.
So here's a reminder of the source-filter model.
This is the filter, and it's most convenient to think about it (as always) in the frequency domain.
So here is its frequency response.
This particular frequency response has two peaks, which are modelling the resonances of the vocal tract: they're the formants.
If we choose to excite this filter with a periodic signal - an impulse train that simulates the vocal folds - then we'll generate a voiced sound with this particular spectral envelope, and that will be a vowel.
When the vocal tract shape changes, the formant frequencies vary, and the speech varies.
We make a different vowel.
By choosing an appropriate frequency response - in other words, appropriate formant values - a speaker can make any vowel they like.
If we characterise this filter by these two formant frequencies, it seems quite natural to plot them on a chart with two dimensions: one for each of those frequencies.
F1 is the first formant; F2 is the second formant.
We've plotted vowel space, and this is clearly a continuum.
By manipulating the shape of the vocal tract, we could make any value of these two formants that we like, limited only by the physical dimensions of the vocal tract: the minimum and maximum achievable frequencies.
Try for yourself.
Try making this vowel in this corner - that's [ɪ] - and then gradually change it to an [e].
Pause the video.
To make an [ɪ] vowel, your tongue has to be high in the mouth and forwards.
To change that to an [e], mostly what you need to do is lower the tongue.
So this dimension is to do with the height of your tongue in your mouth.
Now, make the [ɪ] vowel again and gradually change it this time to an [ɔ:].
Pause the video.
What did you do that time?
Well, mainly you moved your tongue backwards.
So this dimension is something to do with how advanced in the oral cavity your tongue is.
I've drawn this plot in a very particular way with the origin in this corner here.
That's a bit unusual, but the reason for that is so these axes roughly correspond to the height of the tongue, and it's horizontal position, sometimes called 'front-back' or 'advancement'.
That corresponds with the speaker facing to the left.
That's why I've always been drawing this guy facing left, because it matches up with this vowel chart and that's how the IPA draws it.
Vowel space is clearly acoustically a continuum.
You can make any combination of the two formants limited only by physics.
But it's linguistically reasonable to make these two axes discrete.
That's because speakers can only control the tongue position with some amount of precision, and listeners cannot hear very small differences in the formant values.
On the IPA vowel chart, there are 3 possible advancement positions and 4 possible height positions, which gives you 12 vowels.
Oh, except for some in-between ones that don't quite fit that pattern.
Oh, and except that also they seem to come in pairs, where one of them involves rounding the lips, which just makes the vocal tract a little bit longer.
With the two dimensions of advancement and height, a third dimension of rounding, we could make an enormous number of vowels.
No language in the world makes use of all of those vowels.
I mean, imagine having to learn that language!
But we'll come back to that in the video on pronunciation.
That covers the vowels.
The source-filter model is a very natural way to explain how we could make so many different vowels.
Let's talk about some constants.
In previous videos, we already made the unvoiced fricatives [s] and [ʃ].
We did that by changing the source to a random number generator: that makes a sound called white noise.
It has a flat spectrum, but no harmonic structure, putting that through an appropriate frequency response, and producing these fricatives.
If we change the frequency response, we change the fricative.
Our source-filter model has two possible sources: a periodic one and an aperiodic one.
But why not have both at the same time?
Make the sound [s] and then start vibrating your vocal folds as if you're also making a vowel at the same time.
Pause the video.
Yes, it's possible!
What phoneme did you make?
From [s] you made [z]
Our source-filter model could do that.
We're just going to add the two sources together before they go through the filter.
This is great!
We can have this source, or this source, or both.
We can choose the frequency response of the filter.
We can make lots and lots of different sounds.
It seems that we can make almost any combination of the different features of our model and therefore create many, many different sounds.
Here's the row of the IPA consonant chart for fricatives: look how many there are.
We could make fricatives at every possible place in the vocal tract, from using both lips all the way back to the glottis.
At every one of those places, we can either have voicing or not.
How many of these fricatives can you make, from the languages you speak?
Pause the video.
I can only make a restricted range of these fricatives.
I can make the ones from here to here.
I can make [f] / [v] , [θ] / [ð] , [s] / [z] , [ʃ] / [ʒ].
That's a lot of fricatives and we've got a lot of vowels, but let's keep going.
There's more to speech than vowels and fricatives.
There are more things we can control to make more combinations.
If we can find just a few more things to control, then we'll be able to make even more combinations and make a lot more sounds.
Here's how it works for fricatives.
We make a constriction somewhere in the vocal tract, without completely closing it.
We force air through.
That makes turbulence - that's the source of sound - and that's filtered by the remaining vocal tract.
By moving the constriction to another place, we can make a different fricative, because there's a different amount of vocal tract in front of the point of constriction.
What about a whole different way of making the sound in the first place? Instead of making a constriction that produces turbulence, how about completely closing the vocal tract?
Make a complete closure.
Air is pushed up from the lungs and, just like at the vocal folds, eventually this closure will give way under pressure and burst open, and produce an explosive pulse of air that travels through the remaining vocal tract.
Unlike the vocal folds, this process doesn't repeat: this is one-shot.
We get a single plosive.
Again, we can move that place of the closure to somewhere else in the vocal tract: maybe here.
Air pressure builds up; we can't contain it; the closure bursts open, and we make a plosive sound.
We can vary the place at which the articulation happens.
We can also vary the manner in which that sounded is created.
The manner of articulation of a fricative is to make a constriction and force air through a narrow gap.
The manner of articulation of a plosive is to make a complete closure and then for that to burst open under pressure: to explode.
That gives us another row in the constant chart, of plosives.
There are lots of plosives - not quite as many fricatives, but still in many different places along the vocal tract.
Like fricatives, many of them occur in voiced and unvoiced alternatives.
In the consonant chart, the horizontal axis is the place in the vocal tract where the articulation takes place.
The vertical axis is manner.
'Manner' means the configuration of vocal tract created by some interaction between articulators: for example, contact between them.
I'm not going to go through this whole chart because this is not a course on phonetics.
This video is just to help you understand the connection between phonetics and the source-filter model, and how that model helps us understand how all of these different sounds could be created by combining features of the model.
The story in this video, then, was one of contrastive features: a fairly small number of features, each with a relatively small number of possible values, that, in combination, results in a very large number of possible speech sounds.
So many sounds that no single language uses all of them.
Wait a minute!
This video is called 'phoneme'.
I have not actually defined the phoneme yet.
All I've explained is how it's possible to make many, many different sounds.
The real definition of phoneme has to come in the next video, in 'Pronunciation'.
That's because it's language-specific.
Each individual language has a phoneme inventory comprising just some of all the many sounds in the consonant and vowel charts from the IPA.
The main purpose of the IPA is for descriptive purposes: for example, documenting a language, or writing down pronunciations of words.
It's not quite good enough for generating speech, because of co-articulation.
That's the way in which sounds are affected by their context.
The IPA does not capture that in its symbols.
Later on, we're going to need to account for that.
In fact, not just in synthesis but also an Automatic Speech Recognition, contextual variation caused by co-articulation is a key thing that we need to account for in our models of speech.
We'll encounter that for the first time in speech synthesis, where we'll concatenate fragments of recorded speech to generate new sentences.
We won't concatenate recorded phonemes.
We'll record units called 'diphones' which are in fact the units of co-articulation.
From those, through simple concatenation of recorded waveforms, we'll be able to generate synthetic speech.