Module 2 – Acoustic Phonetics

We can analyze differences in the articulation of vowels and consonants in in terms of acoustic phonetic features
Log in

In module 2, we’ll look at specific patterns relating to consonants and vowels, and apply these patterns to the task of segmenting, annotating and extracting measures from various types of speech sounds. As you start to recognize the patterns of speech acoustics, keep thinking about the link between what you see in a visualisation of speech acoustics (e.g., a spectrogram) and what is going on with the physical articulators when people speak. If you’re starting to be able to recognize acoustics of specific consonants and vowels, start thinking about how you might automated that process. What sort of phonetic transcription would be helpful for this?

This week you should try to watch the videos (in the ‘Videos’ tab for this module) before the lecture on Thursday. You can bring your questions to the lecture, or post them on the speech.zone forum.

Lecture slides

Lecture 2 slides (google slides) [updated 24/9/2024]

In these videos, we start to build up our knowledge of acoustics and link that to phonetics.  The first four videos are by Simon King and focus more on the acoustics/engineering side.  The last five videos are from the Virtual Linguistics Campus, introducing acoustic phonetics.  We’ll get our first glimpse at how we can use the frequency properties of speech sound waves (visualised as spectrograms) to figure out what someone said.  We’ll go into more depth about how we get spectrograms in module 3, but for now your task is to think about why using this sort of spectral representation of speech might be helpful for automatic speech recognition and synthesis.

Sound is a wave of pressure travelling through a medium, such as air. We can plot the variation in pressure against time to visualise the waveform.

This video just has a plain transcript, not time-aligned to the videoSound is a wave and it has to travel in a medium.
Here the medium's air.
So there's air in this space.
Sound is a pressure wave, so a wave of pressure is going to travel through this medium.
Let's make a simple sound: a hand clap.
When we do that, our hands trap some air between them.
That compresses the air: the pressure increases.
Then it escapes as a pulse of higher pressure air.
We can draw a picture of that high pressure air propagating as a wave through the medium.
That red line is indicating a higher pressure region of air.
So, this is our first representation of sound, its propagation through physical space where sound travels at a constant speed.
In air, that speed is about 340 metres per second, which means it takes about 3 seconds to travel a kilometre.
But rather than diagrams like this - of sound waves propagating through space and then disappearing - it's much more informative to make a record of that sound.
We can do that by picking a single point in space and measuring the variation in pressure at that point over time.
We make that measurement with a device and that device is a microphone.
So let's use a microphone to measure the pressure variation at a single point in space and then plot that variation against time.
So a plot needs some axes.
Here, the horizontal axis will be time and the vertical axis will be the amplitude of the pressure variation.
It's very important to label the axes of any plot with both the quantity being measured and its units.
This axis is 'time', so we label it with that quantity: time.
Time has only one unit.
The scientific unit of time is the second and that's written with just 's'.
On the vertical axis, we're going to measure the amplitude of the variation in pressure.
So I've put the quantity 'amplitude' and 0 is the ambient pressure.
But we don't actually have any units on this axis.
That's simply because our microphone normally is not a calibrated scientific instrument.
It just measures the relative variation in pressure and converts that into an electrical signal that is proportional to the pressure variation.
So we just mark the 0 amplitude point but don't normally specify any units.
Now we can make the measurement of our sound.
As a sound wave passes the microphone, the pressure at that point rises to be higher than normal and then drops to be lower than normal, and eventually settles back to the ambient pressure of the surrounding air.
Let's plot the output of the microphone and listen to the signal the microphone is now recording.
We're going to take the output of this microphone and we're going to record this signal - this electrical signal - on this plot.
Here's the plot we just made.
The plot is called a waveform and this is our first actually useful representation of sound.
This representation is in the time domain because the horizontal axis of the plot is time.
Later, we'll discover other domains in which we can represent sound, and we'll plot those using different axes.
The waveform is useful for examining some properties of sound.
For example, here's a waveform of a bell sound.
We can see that, for example, the amplitude is clearly decaying over time.
This is a waveform of speech: 'voice'.
It's the word 'voice' and some of the things we can measure from this waveform would be, again that the amplitude is varying over time in some interesting way, and that this word has some duration.
We could enlarge the scale to see a little bit more detail.
This particular part of the waveform has something quite interesting going on.
It clearly has a repeating pattern; that looks like it's going to be be important to understand.
But in contrast, let's look at some other part of this waveform.
Maybe this part here.
It doesn't matter how much you zoom in here, you won't find any repeating pattern.
This is less structured: it's a bit more random.
That's also going to be important to understand.
So far, we've talked about directly plotting the output from a microphone.
Microphones essentially produce an electrical signal: a voltage.
That's an analogue signal: it's proportional to the pressure that's being measured.
But actually we're going to do all of our speech processing with a computer.
Computers can't store analogue signals: they are digital devices.
We're going to need to understand how to represent an analogue signal from a microphone as a digital signal that we can store and process in a computer.
We also already saw that sounds vary over time.
In fact, speech has to vary, because it's carrying a message.
So we'll need to analyse not whole utterances of speech, but just parts of the signal over short periods of time.
Speech varies over time for many reasons, and that's controlled by how it's produced.
So we need to look into speech production, and the first aspect of that that we need to understand is 'What is the original source of sound when we make speech?'

Log in if you want to mark this as completed
Air flow from the lungs is the power source for generating a basic source of sound either using the vocal folds or at a constriction made anywhere in the vocal tract.

This video just has a plain transcript, not time-aligned to the videoWe've seen speech already in the time domain by looking at the waveform.
But how is that speech made?
Well, we need some basic sound source and some way to modify that basic sound source.
The modification, for example, might make one vowel sound different from another vowel sound.
Here we're just going to look at the source of sound, and we'll see two possible sources that can make speech.
Here's someone talking.
He has a vocal tract; that also happens to be useful for breathing and eating, but here we're talking about speaking.
That's just a tube.
For our purposes, it doesn't matter that that's curved.
That's just to fit in our body.
We can think of it as a simple tube, like this.
So here it is, a simplified vocal tract.
At the top here, the lips; at the bottom, the lungs.
The lungs are going to power our sound source.
Airflow from the lungs comes into the vocal tract.
We can block the flow of air with a special piece of our anatomy called the vocal folds.
There they are.
As air keeps flowing from the lungs, the pressure will increase below the vocal folds.
We will get more and more air molecules packed into this tight space.
More tightly packed molecules in the same volume means an increase in pressure.
That's what pressure is: it's the force molecules exert on each other and on their container.
Eventually, the pressure is enough to force its way through the blockage, and the vocal folds burst open.
The higher pressure air from below moves up.
So we get a pulse of higher pressure air bursting through the vocal folds.
That releases the pressure below the vocal folds and they will close again.
Now have a situation where there is a small region of higher pressure air just here, surrounded by lower pressure air everywhere else.
That's obviously not a stable situation.
This higher pressure air exerts a force on the neighbouring air and a wave of pressure moves up through the vocal tract.
It's important to understand that this wave of pressure is moving at the speed of sound, and that's quite different from the gentle air flow from your lungs: your breathing out.
You don't breathe out at the speed of sound!
Breathing is just the power source for the vocal folds.
The air flow in the vocal tract is much, much slower than the propagation of the pressure wave.
So we can neglect the airflow and just think about this pressure wave moving through air.
A pulse of high pressure has just been released by the vocal folds.
Let's make a measurement of that.
Imagine we could put a microphone just above the vocal folds and measure the pressure there.
The plot might look something like this: an increase in pressure as the pulse escapes, a dip as the pulse moves away, and then a gradual settling back to the ambient pressure.
We've created sound!
Sound is a variation in the pressure of air.
Let's listen to that one pulse - that glottal pulse.
Listen carefully because it's going to be very short.
Just sounds like a click.
Let's do that again.
That's the sound of a glottal pulse created in the glottis.
The glottis is a funny thing.
It's the anatomical name for the gap between the vocal folds.
Of course, if the lungs keep pushing air, the pressure will build up again.
After some short period of time, the vocal folds will burst open again and we'll get another pulse.
That will repeat for as long as the air is being pushed by the lungs.
Remember the lungs are the power source of the system
The actual signal will be a repeating sequence of pulses.
I'm going to play this pulse now, not in isolation, but I'm going to play it 100 times per second.
It sounds like this.
Well, it's not speech, but it's a start.
For our purposes, which eventually are going to be to build a model of speech production that we can use various things, the actual shape of the pulse turns out to be not very important.
Let's try simplifying that down to the simplest possible pulse.
That's this signal here, that is zero everywhere and goes up to a maximum value instantaneously and then back down again.
Let's listen to that.
Again, listen carefully.
It sounds pretty similar to the other pulse, just like a click.
We can play a rapid succession of such clicks.
Let's start with a very slow rate of just 10 per second.
Perceptually that's still just a sequence of individual clicks, so I'll increase the rate now to 40 per second.
I can't quite make out individual clicks now.
It's starting to sound like a continuous sound.
If we go up to 100 per second, it's definitely a continuous buzzing sound.
So, although we're talking about speech production here, we've learned something interesting about speech perception already: that once the rate of these pulses is high enough, we no longer hear individual clicks but we integrate that into a continuous sound.
This pulse train signal is going to be a key building block for us.
It's going to be initially just for understanding speech.
That's what we're doing at the moment.
We're going to use it later actually, as the starting point for generating synthetic speech.
There are other sources of sound.
We will just cover the second most important one, after voicing.
Again, here airflow from the lungs is the power source.
But this time, instead of completely blocking the flow at the vocal folds (which are at the bottom of the vocal tract) we'll force the airflow through a narrow gap somewhere in the vocal tract.
So let's make that constriction.
Air flows up from the lungs, and it's forced through this narrow gap.
When we force air through a narrow gap, it becomes turbulent.
The airflow becomes chaotic and random, and that means that the air pressure is varying chaotically and randomly.
And since sound is nothing more than pressure variation, that means we've generated sound!
So again, if we put a microphone just after that construction and recorded that signal created by that chaotic, turbulent airflow, it looks something like this: random and without any discernible structure.
Certainly no repeating pattern.
That signal would sound like this.
It's noise.
Why don't you try making a narrow construction somewhere in your vocal tract and push air through it from your lungs? I wonder how many different sounds you could make that way.
You can change the sound by putting the constriction in a different place.
I'll give you a few to start with.
I'm sure you can come with many more.
These then are the two principal sources of sound in speech.
On the left, voicing.
That's the regular vibration of the vocal folds.
On the right, frication, which is the sound caused by turbulent airflow at a narrow constriction somewhere in the vocal tract.
There are few other ways of making sound, but we don't really need them at this point.
These are going to be enough for a model of speech that will be able to generate any speech sound.
We saw everything in the time domain here.
We plotted lots of waveforms.
We've been talking about the sound source, and we now know why a speech waveform sometimes has a repeating pattern.
It's because the sound source itself was repeating.
We call such signals 'periodic', and you'll find that whenever there's a repeating pattern in the waveform, that can only be caused by voicing: the periodic vibration of the vocal folds.
Whenever there is voicing, you will also perceive a pitch.
Perhaps you could call that a musical note or a tone.
Pitch is controlled by the speaker's rate of vibration of the vocal folds.
So we could use pitch to help convey a message in speech.

Log in if you want to mark this as completed
The vocal folds block air flow from the lungs, burst open under pressure to create a glottal pulse, then rapidly close. This repeats, creating a periodic signal.

This video just has a plain transcript, not time-aligned to the videoThe most important source of sound in speech is the periodic vibration of the vocal folds.
Here's a reminder of the two principal sources of sound when producing speech.
Now let's introduce some engineering terms to describe their general properties.
On the left we have voicing.
That's the phonetic term: voicing means the vibration of the vocal folds and that results in a periodic signal.
Periodic signals are predictable.
You could continue the plot on the left and tell me what happens next.
We call this type of signal 'deterministic'.
On the other hand, frication results in a signal that has no periodicity: it's very unpredictable.
So we could say that that is 'aperiodic' or 'non-periodic'.
(They mean the same thing.)
Aperiodic signals are not predictable.
You cannot guess what happens next in the plot on the right.
So we could use some engineering terms.
Periodic signals are 'deterministic': we know what happens next.
Aperiodic or non-periodic signals are 'stochastic': we don't know what happens next, they're random.
Periodic signals are so important, we're going to take a closer look now.
All periodic signals have a repeating pattern.
In this particular signalm it's really obvious what that repeating pattern is.
We can also see exactly how often it repeats: every 0.01 s or 1/100th of a second.
We can use some notation to denote that.
We use the term T0 to denote the fundamental period.
T0 has a dimension of time and a unit of seconds.
Here we can see what T0 is.
It's the time it takes for this signal to repeat.
Here's another signal.
This is a very special signal called a sine wave.
This one has a fundamental period of 0.1 s or 1/10th of a second.
So if we repeated this cycle 10 times, we'd fill up a duration of exactly 1 s.
Another way of talking about the signal is, instead of saying that the fundamental period is 0.1 s, we can say that it has a fundamental frequency of 10.
We'll use the notation F0 to denote fundamental frequency, and that's just going to be equal to 1/T0.
But what are the units of frequency?
10 is the number of periods per second and the units of time are seconds.
The units of frequency are 1/s (1 over seconds) or, in scientific notation, 'seconds to the minus 1'.
That's a little bit of an awkward unit.
Since frequency is so important, we don't normally write down this unit, which we could say out loud as 'per second'.
We give it its own units of Hertz.
These are all equivalent.
The scientific unit of frequency is Hertz.
But just remember, it always means precisely the same as '1 over seconds' or 'seconds to the minus 1'.
The old fashioned unit of frequency was actually very helpfully called 'cycles per second', but we don't use that anymore.
So what are the fundamental periods and the fundamental frequencies of these signals?
Sit down and work them out while you pause the video.
I hope you remembered to always give the units.
In the top row, we've got 10 cycles in one second.
So that's a fundamental period of 0.1 s and we've got then an F0 of 10 Hz.
I hope you got that right, with the units.
Top right, we've got a fundamental period of 0.01 and that gives us a frequency of 100.
But always write the units!
T0 = 0.01 s and F0 = 100 Hz.
Down on the bottom left, we've got a much higher frequency signal.
That's got a fundamental period of 0.0005 s and then it's got a fundamental frequency of 2000 Hz.
We could now use some more scientific notation because once we're into the thousands we could start saying a multiplier for the Hz.
Instead of writing 2000 Hz, we could write 2 kHz.
Those are the same thing.
Bottom right, it's a little bit trickier.
This is a speech signal, but it's pretty obvious what the fundamental period is here.
We can see a clear repeating pattern, and it's going to be from here to here, and so T0 is pretty close to 0.005 s to give us an F0 of 200 Hz.
Periodic signals are very important as a sound source in speech.
Thinking about speech perception and about getting a deeper understanding of speech as a means of communicating a message, we'll find that periodic signals are perceived as having a pitch, or a musical tone, and that can be employed by speakers to convey part of the message.
Pitch is part of a collection of other acoustic features that speakers use, which collectively we call prosody.
Thinking on the other hand about signal processing and about getting a deeper understanding of speech signals, for example, so we can make a model of them, we're going to need to move out of the time domain and into the frequency domain where will see that this very special periodic nature in the time domain has an equally-special, distinctive property in the frequency domain, and that is harmonics.

Log in if you want to mark this as completed
Periodic signals are perceived as having a pitch. The physical property of fundamental frequency relates to the perceptual quantity of pitch.

This video just has a plain transcript, not time-aligned to the videoPeriodic signals have a very important perceptual property of pitch.
That means that periodic signals are perceived as having a musical note: a tone.
Here are some signals that are periodic.
They all have a repeating pattern.
And so we predict, just by looking at them, that when we listen to them, there will be a pitch to perceive.
There's the simplified glottal waveform we've seen before.
Well, not very pleasant, but it certainly has a pitch.
Here's a sine wave: that's a very pure, simple sound, again, with a very clearly perceived pitch.
Finally a short clip of a spoken vowel.
I'll play that again.
Again, a clear pitch can be perceived.
Pitch is a perceptual phenomenon.
We need to establish the relationship between the periodicity, a physical signal property of F0 (fundamental frequencyP and this perceptual property of pitch.
Let's do that by listening to some sine wave, some pure tones.
I'll play one at 220 Hz and then I'll play one at 440 Hz.
Hopefully, you have a musical enough ear to here that's an octave.
There's a clear musical relationship between the two.
The second one is perceived as having twice the pitch of the first.
So let's go up another 220 and see what happens.
No, that's definitely not in octave!
You don't need to be a musician to know that.
So let's go up again.
That sounds like it might be an octave above 440.
So let's listen to octaves.
We've discovered something really important: that the relationship between the physical signal property F0 and the perceptual property pitch is not linear.
To perceive the same interval change in pitch - an octave - we don't need to add a fixed amount to the frequency: we need to double the frequency.
So this relationship between F0 on pitch is actually logarithmic.
It's non-linear.
That non-linearity is one aspect of a much more general property of our auditory system as a whole.
It is, in general, non-linear.
We can probably make use of that knowledge later on.
So for speech, where the pitch is varying in interesting ways because it might be carrying part of the message, we would need to measure the local value of F0 and then plot how that changes against time.
Now, because there's a very simple relationship between F0 and pitch, you'll find the two terms actually used interchangeably in our field.
But that's not technically correct!
They are not the same thing.
F0 is a physical property: it's the rate of vibration of the vocal folds.
We could measure that if we had access to the speaker's vocal folds.
Or we could estimate it automatically from a signal.
Here's some software that will do that.
It's called Praat.
Other software can also do the same thing.
It will make that measurement of F0 for you.
In fact, Praat calls it pitch, even though it's estimating at F0!
But it's very important to remember the software does not have access to the speaker's vocal folds.
It can only estimate F0 from the speech signal, using some algorithm.
That's a non-trivial estimation, so you must always be aware that there will be errors in the output of any F0 estimation algorithm.
This is not truth: this is an estimate.
'Nothing's impossible'
The term pitch really then is about the perceptual phenomenon.
It only exists in the mind of a listener, and so to do experiments about pitch would have to involve humans listening to speech.
Experiments about F0 could be done on speech signals analytically.
So speakers can control the fundamental frequency as well as the duration and the amplitude of the speech sounds they produce.
They can use all of those acoustic properties - and others - to convey parts of the message to a listener.
We use the term 'prosody' to refer to collectively the fundamental frequency, the duration, and the amplitude of speech sounds (sometimes also voice quality).
Later, then, when we attempt to generate synthetic speech, we'll have to give it an appropriate prosody if we want it to sound natural.

Log in if you want to mark this as completed
The shape of the vocal tract results in different frequencies getting boosted or dampened.

 

Video on youtube

Spectrograms display the spectrum of frequencies in a recording over time. This sort of frequency representation is the basis of speech technologies automatic transcription and speech synthesis.

Video on youtube

More detail on how spectrograms map to vowel articulations

Video on youtube

More detail on how spectrograms map to consonant articulations. Learning this type of mapping is how we do speech recognition and synthesis!

Video on youtube

A worked example of spectrogram reading, i.e. speech recognition. This sort of spectrogram reading exercise would be a "stretch" question in this speech processing course, but it's a foundational part of most phonetics courses. We'll focus on getting computers to do it instead!

Video on youtube

Readings for module 2 focus on the acoustic properties of consonants and vowels.

Reading

Ladefoged & Johnson – A course in phonetics – Chapter 2 – Phonology and Phonetic Transcription

Basics of phonology and phonetic transcription. Read this over Speech Processing modules 1 and 2.

Ladefoged & Johnson – A course in phonetics – Chapter 8 – Acoustic phonetics

Links the source-filter model to spectrograms and acoustic analysis of speech.

Peterson & Barney – Control Methods Used in a Study of the Vowels

Examines the production and perception of vowels. This is a classic paper that many other studies on have built on.

Exploring Speech Acoustics

In the lab for module 2, you will continue explore speech acoustics through visualisations in Praat.

You can find the lab instructions here: phon_lab_2

If you have taken LEL2b or a similar intro to phonetics course, you may find this material very familiar. If it i and you haven’t done much maths recently, it may be a good opportunity to spend some time on maths revision (see notes in the Module 1 lab tab). Or you could just get a head with the readings/videos for upcoming modules!

Lab Answers and Commentary

That completes our main modules on articulatory and acoustic phonetics. You should now have a basic understanding of how vowels and consonants are produced in terms of the vocal tract and it’s articulators.  You should also have seen that we can “see” evidence of these articulations in speech acoustics, as represented in a spectrogram.  These acoustic cues can be used to “read” spectrograms, i.e. to be able to tell what someone has said by just looking at a spectrogram.   This is in essence what automated transcription systems attempt to do!  So, it’s important to know what acoustic properties of the speech waveform are important for identifying what has been said for speech recognition.  For speech generation, we want to make sure we generate the right acoustic features so that the waveform is understood as speech.

The next two modules will look at the aspects of acoustic phonetics from more of an engineering point of view. We’ll come back to more phon issues in later weeks, as learn more about TTS and ASR. In particular, we’ll look at the source-filter model from both theoretical and engineering points of view.

Making connections between the phonetics material and the speech technologies we’ll look at in the coming weeks will help you be an active learner. Just now, you probably have an understanding of issues in phonetics that will feed into how we design speech technologies, but only a vague idea of the ‘big picture’: the ideas may not yet be well-organised in your mind. Keep connecting and organising, and you’ll find that it does all join together.

What you should know from Module 2

Note: we’ll continue to discuss a lot of the ideas around the frequency domain, resonance and the source filter model in modules 3 and 4. 

What does a speech waveform (i.e. in the time-domain) represent?

  • Time versus amplitude graphs
  • Oscillation cycle
  • Period T and wavelength λ (we’ll revisit this in the next few modules)
  • Frequency (F=1/T)
  • What are “Hertz”?
  • How to calculate the frequency of a waveform by measuring pitch periods (Example in the ”waveform” video)

Types of waveform:

  • Simple versus complex waves
  • Periodic versus aperiodic waves
  • Continuous versus transient waves
  • Fundamental Period (T0)
  • Fundamental frequency (F0)

Spectrum:

  •  The spectrum as a representation of waveform frequency components
  •  What is the spectral envelope?
  •  Why do we consider F0 and harmonics to be “source” characteristics
  •  What the relationship between formants and resonance
  •  F0 is not a formant!

Spectrogram:

  •  What do the x and y axes of a spectrogram represent (e.g. in Praat)?

Acoustics of Vowels:

  •  What is the general relationship between formants (acoustics) to tongue position (articulation):
    •  F1 and vowel height
    •  F2 and vowel frontness
  •  Acoustic vowel space:
    • You don’t need to to know the specific formants associated with different vowels, but if you understand the relationship between height/frontness and formants you should be able to deduce this!

 Acoustics of Consonants:

  •  What does voicing look like on a spectrogram or spectrum?
  •  Identify basic acoustic characterstics (i.e. on a spectrogram) of consonant manners: plosives (i.e., stops), aspiration (for plosives), fricatives, nasals, approximants.
  • Clues for place of articulation: stops, fricatives

Vowel space

  • How can you interpret an F1 vs F2 plot of vowel measurements
  • Relate this to vowel characteristics/the IPA vowel chart (e.g. tongue height and frontness)

Key Terms

  • waveform
  • amplitude
  • sine wave
  • period
  • frequency
  • wavelength
  • Hertz
  • fundamental frequency
  • harmonics
  • spectrum
  • spectrogram
  • spectral envelope
  • Fourier transform
  • formant
  • vowel height
  • vowel frontness
  • open vowel
  • closed vowel
  • plosive
  • voicing
  • voice onset time
  • aspiration

Forums Courses Speech Processing Module 2 – Acoustic Phonetics Foundations of speech