Module 2 – Acoustics of Consonants and Vowels

We can analyze differences in the articulation of vowels and consonants in in terms of acoustic phonetic features
Log in

The module 2 Thursday lecture (to be held in week 2, 2023-24) is cancelled due to the UCU strike. You can still watch the videos and do the readings!

In module 2, we’ll look at specific patterns relating to consonants and vowels, and apply these patterns to the task of segmenting, annotating and extracting measures from various types of speech sounds. As you start to recognize the patterns of speech acoustics, keep thinking about the link between what you see in a visualisation of speech acoustics (e.g., a spectrogram) and what is going on with the physical articulators when people speak. If you’re starting to be able to recognize acoustics of specific consonants and vowels, start thinking about how you might automated that process. What sort of phonetic transcription would be helpful for this?

This week you should try to watch the videos (in the ‘Videos’ tab for this module) before the lecture on Thursday. You can bring your questions to the lecture, or post them on the speech.zone forum.

Total video to watch in this section: 54 minutes.  Videos for the following are currently missing transcripts, but you can find the scripts for the videos in the following pdfs.

We’re in the process of getting the slides added for all of the module 1 and 2 videos!

The waveform and a definition of the fundamental period. Note: there's a small typo at 5:50: the text should read T=0.08s (matching the voice-over) rather than T=0.8s.
Log in if you want to mark this as completed
Simple, complex, periodic, aperiodic, transient, and continuous waveforms.

This video just has a plain transcript, not time-aligned to the videoAs we have already seen, waves can be either simple or complex. A simple wave be described by the mathematical sine function showing a simple oscillation. A complex wave is made up of at least two sine waves added together.
Waves may differ in their complexity, while sharing a fundamental frequency. For example, both the complex red wave and the simple blue wave have an F0 of 100 Hz
The specific shape of a complex wave may not affect our perception of the pitch of that sound, but it does effect the overall quality.
Each of the waves here differ in their complexity, but all have the same F0.
While simple waves are always periodic, meaning that they repeat at regular intervals, complex waves may be either periodic or aperiodic
Complex periodic waves have a repeating pattern, and we can use that pattern to calculate the fundamental frequency.
Complex aperiodic waves do not have any pattern to their oscillations, and therefore will not have an F0.
Complex waves may also be either continuous or transient.
Transient waves are always aperiodic, while continuous waves may be either periodic or aperiodic.
Here we have examples of continuous and transient waves both in speech and non-speech. in the upper left we have a continuous wave represented a continuous oscillation of a vowel.
In the lower left, we have continuous aperiodic wave represented white noise.
In the upper right we have a transient aperiodic wave representing the strike of a hammer.
And in the bottom right, we have a transient aperiodic wave representing the burst of an unaspirated [t].
We can use these concepts of periodic, aperiodic, transient and continuous to describe major classes of speech sounds in the waveform.
For example, this waveform of a man saying “scan it” is made up of an aperiodic continuous sound, followed by a transient sound, followed by three periodic continuous sounds with different complexities, and finished by another (low amplitude) transient sound.
These various sound waves correspond to major classes of speech sounds, namely fricative, plosive, vowel, and nasal. As a result, the waveform of “Scan it!” is visually similar to the waveform of “Spoon up!”, which is made of the same major classes in the same order.

Log in if you want to mark this as completed
The spectrum, its spectral envelope, and harmonics.
Log in if you want to mark this as completed
A 3-dimensional figure plotting amount of energy against both frequency and time.
Log in if you want to mark this as completed
The waveforms, spectra and spectrograms of vowels.

This video just has a plain transcript, not time-aligned to the videoIn this video we will consider the waveforms, spectra and spectrograms of vowels.
We have seen that the component frequencies and amplitudes of complex waves are revealed by the spectrum, with harmonics shown as individual peaks, and formants as ranges of amplified and dampened frequencies. These spectral differences result in different qualities that we can hear.
In the case of vowels, there is still more that we can learn from the spectrum and by extension from the spectrogram.
The first and second formants of the vocal tract correlate with perceptions of vowel height and advancement, with particular vowels having characteristic spectral shapes.
Here we see spectra of the vowels [a] [i] and [u] with the first and second formants highlighted.
In [a] F1 and F2 are quite close together, while in [i] they are rather far apart. In [u] F1 and F2 are again close together, but lower in the frequency range than we saw for [a]
The following slides will illustrate the vowel formant patterns in American English vowels, pointing out some generalizations along the way.
These patterns are not just happenstance. The formant positions in the spectrum are the result of the positions of the lips and tongue during vowel articulation.
The relative position of the first and second formants result in the different vowel qualities that we hear.
First, we will look at the relationship between formants and vowel dimension of height. Here we see spectrograms of 8 american English vowels. [i, ɪ, ɛ, æ, ɑ, ɔ, ʊ, u]
Remember that we can describe these vowels using their IPA symbols according to height, advancement and rounding. For example, the [u] vowel is high back and rounded. We will soon see that the first two formants are related to the Height and Advancement of the vowels and we will describe this relationship.
The formants of these vowels are visible as dark horizontal bands at various frequency ranges in the spectrogram.
The first three formants are indicated by arrows on the left hand side of each spectrogram.
If we consider first the high vowels and look at the first formant, we can see that the first first formant is at quite a low frequency for each of these four vowels. (There does not seem to be much similarity across the F2 frequencies of high vowels.) In fact it turns out that the lower the vowel, the higher the first formant will be in the spectrogram.
This relationship between vowel height and the location of the first formant becomes more clear when we compare the lower vowels.
[ɛ, æ, ɑ, ɔ] are all either open or open-mid. Across these four open vowels, the F1 is again rather similar, but if we compare the high vowels, we see that F1 is higher in the low vowels and lower in the high vowels.
The first formant is therefore inversely related to vowel height. Close/high vowels have lower F1, while open/low vowels have higher F1.
Now let’s look at advancement, or how far front or back a vowel is.
The top row of vowels all share the phonetic property FRONT, while the bottom row all share the property BACK.
We have already seen that the first formant is related to vowel height. Now let’s consider the second formant.
If we compare the second formant across the front vowels we see that they are all fairly similar to each other in frequency. By comparison, the back vowels have lower F2 values than the front vowels as a whole.
Therefore, the second formant is directly related to vowel frontness. The more front a vowel is, the higher F2 will be. The more back a vowel is, the lower F2.
We can also consider the relationship between the first and second formants in relation to vowel height. If we look at F1 and F2 of the front vowels, we can see that there is typically a bit of the frequency range that is low in amplitude between the first and second formants. By comparison, the back vowels have first and second formants that are very close together.
This figure summarizes the formant values from the preceding spectrograms in a schematic way to allow for ease of comparison. In this form, we can see that the close vowels have lower F1 than the open vowels. We can also see that the front vowels on the left have higher F2 than the back vowels on the right.
It is also interesting to note that the back vowels all have similar F2 values, suggesting they are roughly equally far back, while the front vowels show decreasing F2 for more open vowels.
This is due to the position of the jaw during the production of these more open vowels, causing them to be produced slightly further back than the close vowels.
Phoneticians use the vowel formant patterns described above to visualize the acoustic relationships between vowels by plotting the first and second formants against each other.
This sort of plot is known as the vowel space. Here we see a plot of the 8 vowels that we have been looking at so far.
This plot puts F1 on the y-axis and F2 on the x-axis, in what is mathematically a rather unusual arrangement, but which results in a figure that looks rather strikingly like the IPA vowel chart.
This view highlights the relationship between the first formant and vowel height, and the second formant and vowel advancement.

Log in if you want to mark this as completed
Consonants are a diverse set of speech sounds ranging from vowel-like approximants to complete closure of the vocal tract with silence.

This video just has a plain transcript, not time-aligned to the videoConsonants are a diverse set of speech sounds ranging from vowel-like approximants to complete closure of the vocal tract with silence. By utilizing the detail available to us in spectrographic displays of speech sounds, we are able to categorize and describe speech with far more accuracy than we could with waveforms alone. This video will present the acoustic characteristics of the parameters that define consonant sounds (voice, place, and manner).
First we’ll begin with voicing. All speech sounds can all be categorized as voiced or voiceless. The presence or absence of voicing is visible in both the waveform and the spectrogram, though it has a few different appearances.
For some sounds, the typical production of some sounds is voiced. Examples of these sounds are vowels, nasals, and approximants. Together we can refer to them as sonorants, although sometimes the vowels are left out of this category. In sonorants, voicing is apparent in the periodic structure of the waveform, vertical striations in the spectrogram, or as clearly defined harmonics in the spectrum. In the waveform of “scan it” for example, we can see that voicing is present from the start of the [a] vowel through the end of the [ɪ] vowel. The particulars of the waveform changes with each phone, but the presence of voicing is continuously evident throughout these three sonorant sounds.
We can also see voicing in the spectrogram and spectrum. In the spectrogram, voicing is apparent as vertical striations at more or less regularly spaced intervals. These striations
may be closer together or farther apart, depending on the fundamental frequency. Close striations indicates a higher F0, and wider striations indicates a lower F0.
If we look at the spectrum, we can see the harmonic structure of the voiced wave. Here again we have an indication of F0. When F0 is low, the harmonics are closely spaced. If the F0 is high, the harmonics will be spread further apart.
Other sounds, such as plosives and fricatives, may be produced either with voicing or without it. These sounds are called obstruents and have a different appearance from the sonorants. Here we have a broadband spectrogram, of a voiced sound between two vowels. In this case, voicing appears in the spectrogram as shading in the low frequency range at the bottom of the spectrogram. This is called the voice bar and is present when the vocal folds are vibrating.
We’ll now move on to manners of articulation, considering their appearance in acoustic representations. We’ll start with plosives, which are perhaps the most straightforward to identify in both the waveform and the spectrogram.
Here we have a waveform and spectrogram of a voiceless plosive. Like all stops, it is made up of two parts: a closure, reflecting the constriction of the vocal tract, and a noise burst reflecting release of that constriction.
Here we can see a region of low energy in the spectrogram. This is an indication of closure and is also visible in the waveform as a region of low or zero amplitude. Here we can also see that this is a voiceless stop due to the lack of voice bar near the end of closure. At the release of the closure we can see a spike in the waveform and a vertical band of energy across the entire frequency range in the spectrogram.
Voiced plosives are similar to voiceless in that they involve a closure of the vocal tract, however, we can see that they are voiced due to the presence of the voicebar during that closure, as well as. We also notice that the burst release tends to be less clearly visible in the spectrogram than it was in the voiceless stop.
Voiceless stops are sometimes aspirated, that is, accompanied by a strong puff of air after the release of the closure. This aspirated release is visible in both the waveform and spectrogram. In aspirated stops, the release portion tends to be longer and stronger than in either voiced or voiceless plosives. The release burst is also accompanied by a bit of turbulent noise -- this is the “aspiration” that we speak of. Here again we see virtually no activity in the waveform or the spectrogram during the stop closure, and a strong burst release accompanied by aspiration.
Fricatives are sounds that are produced with frication, or turbulent airflow. In general, this turbulent airflow generates high frequency noise. Fricatives are often very loud, or high amplitude, and their energy will be dispersed over a broad frequency range in the spectrogram.
Here we can see voiceless labiodental fricative [f], voiceless interdental fricative [θ], voiceless alveolar fricative, and voiceless postalveolar fricative [ʃ] preceding vowels. In each case, we can see some diffuse noise spread across the frequency range, indicating frication. However we can also see that not all fricatives are the same. The amplitude of the noise in [s] and [ʃ] is much higher than that of [f] and [θ]. This makes them more salient to the ear and more easy to hear. In voiced fricatives the high frequency noise is sometimes less apparent and we can also see evidence of voicing as striations in the spectrogram or sometimes periodicity in the waveform.
Nasals are sounds that are produced similar to stops in that the mouth is closed at some place of articulation but air is allowed to pass through and resonate in the nasal cavity at the same time. As voice sounds nasals will feature the striations indicative of vocal fold vibration. Here we can see the vertical lines in the spectrogram indicating voicing. Nasals will also have low energy in the spectrogram compared to the surrounding vowels. Here we see spectrograms of three nasals: bilabial [m] alveolar [n] and velar [ŋ]. In each case we can see that the amplitude drops off sharply after the vowel as soon as the nasal closure begins.
Approximants are sounds that are acoustically very similar to vowels. They are produced with visible voicing and formants in the spectrogram however they typically have lower amplitude than vowels, which is visible in the waveform and in the shading of the spectrogram. They will also be apparent by their continuous transitions from approximant to vowel and may look similar to diphthongs.
Here we have an example of a labiovelar approximant [w], and we can see that the formants start off in a very low position and then transition steeply into the vowel [ɛ]. This steep rise from the approximant into the vowel is quite typical, though sometimes the transition can be more gradual. An example of this is visible on the right where we see the word yell. Here we have a palatal approximant transitioning smoothly from an articulation quite like [i] to that of [ɛ].
The alveolar approximants [l] and [ɹ] are often difficult to identify in a spectrogram. They will be characterized by lower amplitude than the adjacent vowels and most of their energy will be low in the spectrum. That is, in the low frequency range. In some cases there may be an abrupt boundary between the alveolar lateral approximate [l] and the adjacent vowels, much like a nasal, though this is not always reliable. In the alveolar approximant [ɹ] we often see a steep rise of the third formant out of the approximant into the vowel. When the alveolar approximant [ɹ] occurs at the end of a word, we often get something called “r-coloring” which affects the formant structure of the vowel but does not appear as a distinct segment in the spectrogram.
The acoustic signal also gives us clues to the place of articulation of stops and fricatives. In plosives, both the release burst and the vowel format transitions will offer indications of the place of the stop closure. Here we have spectrograms of bilabial alveolar and velar stops. In each case the plosives are followed by the same vowel [a], and we can see changes in the formant structure as a result of the stop closure. In the bilabial plosive the first three formants all start at lower frequencies than would be expected for the vowel quality itself. They will rise out of the stop closure until they reach their steady state.
In the alveolar stop here the second and third formants remain steady. This is a change in structure from the bilabial stop though it is subtle and sometimes may be difficult to spot.
Velar stops are often characterized by format movement that brings the second and third formants together. The second format will be quite high and the third formant may move down to meet it. This formant structure is known as the “velar pinch” and is a dead giveaway for a velar closure if you see it.
We can also distinguish place of articulation in fricatives based on the acoustic information. A common way to do this is to identify where the frication noise is concentrated in the frequency range. The alveolar fricative [s] tends to have energy concentrated between five and ten thousand hertz and have a very high amplitude. The post alveolar fricative [ʃ] will tend to have its energy concentrated between three and five thousand hertz and also have rather high amplitude. In contrast, the labial dental fricative has very weak energy or low amplitude and this energy is centered between three and four thousand hertz. The interdental fricative [θ] similarly has weak energy but its energy concentration is around 8000 hertz. We can also sometimes see vowel formant transitions in relation to fricatives though these may be less reliable than in stops.

Log in if you want to mark this as completed
The terms “voiced” and “voiceless” are used indicate whether the vocal folds are vibrating, but these terms do not tell us much about when those vibrations occur, relative to other events.

This video just has a plain transcript, not time-aligned to the videoVoicing is the result of vocal fold vibration. In previous videos we have seen how voicing appears in waveform, spectrum and spectrogram, and we know that it is independent of place and manner in consonant production. We use the terms “voiced” and “voiceless” to indicate whether the vocal folds are vibrating, but these terms do not tell us much about when those vibrations occur, relative to other events.
Consider for a moment a phrase like “come and get it”. When we examine the waveform and spectrogram, we can see that voicing starts and ends at various places throughout the phrase. Sometimes, voicing persists throughout a number of phones, across both consonants and vowels, as we see here. Without even listening to the audio, we can see that voicing is present by observing the voice bar here at the bottom of the spectrogram, and also by observing the periodic structure of the waveform.
At other times, voicing alternates from one phone to the next, as we see at the start of this phrase where a voiceless stop is followed by voicing in a vowel. (The opposite pattern appears at the end of the phrase, where the vowel is followed by a voiceless stop.)
If you look closely, you might also notice that there is a very brief interval of time where voicing stops during the closure of the [g] sound. If we remember that in the IPA the [g] symbol stands for a voiced bilabial stop, this becomes very curious indeed. Why is a voiced sound produced without voicing?
As it turns out, the alignment of voicing with oral closure during stops varies across languages, and we describe this alignment by referring to Voice Onset Time.
Voice Onset Time is typically used only for stops/plosives, so let’s briefly take a moment to consider why this is.
First, recall that some sounds are typically voiced:
- Sonorants
While others come in pairs of either voiced or voiceless sounds
- Obstruents
Voicing depends on air being able to flow through the larynx in order to set the vocal folds in motion. Sonorants are typically voiced continuously throughout the entire duration because the vocal tract is open, allowing air to flow freely.
Try this: hum a bilabial nasal [m]; how long can keep it going?— Forever!
In obstruents, on the other hand, there is constriction in the vocal tract, which impedes or obstructs airflow. This will have implications for how and when voicing can occur.
In fricatives, this doesn’t really get in the way of voicing too much (although there are aerodynamic tradeoffs in order for both voicing and frication to happen at the same time. We won’t get into those here.)
Try this: how long can you sustain a [v] sound? – quite a long time!
When we consider stops, things start to get a bit more interesting. Since stops involve complete closure of the vocal tract, there is a limit to how much air can flow in order to create voicing.
Try this: how long can you sustain voicing in a voiced bilabial stop [b]? (what happens when you try to keep voicing going longer?) (Sidenote: since voiceless stops don’t involve voicing, you should be able to hold that closure as long as you can hold your breath)
So, we can now see that there are physical limitations to how long voicing can overlap with oral stop closures. While voicing *could* begin and end at any time during sonorants or fricatives, it tends to persist throughout those sounds. In stops, however, the timing of voicing relative to stop closure and release is variable.
Phoneticians describe voice onset time (VOT) in plosives relative to the release burst. This is analogous to a number line, where the burst is located at zero. Voicing before the burst is measured in negative numbers, while voicing that begins after the burst is measured in positive numbers. Note that VOT (like most durations in speech) is typically reported in milliseconds.
As a result of this style of measurement, there are three types of VOT:
pre-voicing, or voicing lead
zero voicing, or short voicing lag
and post-voicing, or long voicing lag
First let’s consider the case of prevoiced stops. Stops that are produced with prevoicing, or negative VOT, will show evidence of voicing during the oral closure, followed by a release burst.
This spectrogram shows an example of a voiced bilabial stop [b], produced between two vowels. Here we can see that voicing continues throughout the stop closure, which is shaded in gray. Voicing is evident in both the waveform, where periodic oscillations are present, as well as in the spectrogram where we can see a voice bar and vertical striations indicative of glottal pulses. The duration of voicing prior to the release of oral closure is 158 ms, which we report as a negative voice onset time of -158 ms.
Stops produced with zero voice onset time have voicing that begins simultaneously (or nearly simultaneously) with the release of oral closure.
This spectrogram shows an example of a voiceless bilabial stop [p], produced between two vowels. We can see the closure of the stop both in the waveform where the signal is flat, and in the spectrogram where there is no shading anywhere in the frequency range. The release burst is shaded in gray, and we can see that the burst duration is short and voicing begins immediately after that release. We often refer to this type of stop release as having “zero” VOT, but often it in fact involves a very short lag of a few milliseconds. In this case the lag lasts for 13 ms after the initial release burst.
The third type of VOT is post-voicing, also called long-voicing lag or positive VOT.
Stops that are produced with positive VOT will typically have no evidence of voicing during oral closure, and the release burst will be followed by an interval or aspiration, or turbulent noise resembling frication.
This spectrogram shows an example of an aspirated voiceless bilabial stop [pʰ] produced between two vowels. Here again we can see that the stop closure is voiceless by examining the waveform and spectrogram, though you may notice that voicing does not end immediately when the closure begins. This is known as “residual voicing” and is quite common, even in voiceless stops. In this case, the burst release is followed by a bit of noise, which often appears in stops with long-lag VOT. We call this noise aspiration. If we look at the waveform and spectrogram here, we can see that this closely resembles a fricative, and indeed aspiration noise is a type of frication. Because voicing begins sometime after the release of the oral closure, we report the 62 ms of lag as a positive number.
So far we have seen how voicing may align with the closure and release phases of stops. Now we will think a bit about how this aligns with the IPA. The IPA is a system of phonetic transcription based on articulatory parameters, but the precise alignment to articulatory (and acoustic) events is generally not specified. This is in part because one of the main goals of the IPA is to capture linguistically relevant contrasts in the sound system of a language – not to faithfully represent the particulars of any one production of speech.
In fact, studies have shown that not all voiceless sounds are voiceless in the same way. We might think, for example, that all voiceless unaspirated stops have the same VOT values. Perhaps we might expect them all to have zero (or small positive) VOT of roughly the same magnitude. However, place of articulation actually has an effect on VOT, with bilabial sounds having the shortest VOT, followed by alveolars, then by velars.
Languages may also differ as to how they maintain voicing contrasts in their sound systems, and linguists often use symbols in a confusing way when describing those contrasts. For example, both Spanish and English are said to have voiced and voiceless stops, which we transcribe using the appropriate IPA symbols for such sounds.
However, if we look at the acoustic productions of these sounds, we see that voiced stops in English have zero VOT, while voiceless stops have positive VOT (and aspiration). In spanish voiced stops are are pre-voiced, while voiceless stops have zero VOT and are unaspirated. Nevertheless, we use the [b] symbol to represent both the English zero VOT ‘b’ sound as well as the negative VOT ‘b’ of Spanish.
Furthermore, some languages even have more than 2 voicing contrasts, adding complexity to the question of how to represent such productions with a phonetic transcription system.
For example, Thai maintains 3 voicing categories: voiced, voiceless unaspirated and voiceless aspirated, while
Hindi maintains a 4: voiced, voiced aspirated, voiceless unaspirated, and voiceless aspirated.
So, despite using the same terminology to identify voiced and voiceless sounds, languages can and do differ with respect to how they align voicing with stop closure, and these differences may not always be apparent from phonetic transcriptions alone.

Log in if you want to mark this as completed
Variability in the acoustic vowel space as well as the relationship of it to inventories of contrastive vowel sounds in languages.

This video just has a plain transcript, not time-aligned to the videoIn previous videos we have seen that the first and second formants of the vocal tract are related to our perceptions of vowel quality, and the source-filter model provides us a way to understand the origins of these formants and why they change with the changing shape of the vocal tract. In this video, we will consider variability in the acoustic vowel space as well as the relationship of it to inventories of contrastive vowel sounds in languages.
We’ve already seen that plotting the first and second formants against one another results in a figure that looks quite similar to the vowel quadrilateral of the IPA, though the correspondence is not perfect.
In the plot on the right, formants from one token of each vowel quality were used, giving the impression of a clean acoustic space with neatly delimited vowel categories. Unfortunately, natural data from spontaneous speech and multiple speakers is never this straightforward.
The data shown here are taken from the classic study of American English vowels by Peterson and Barney in 1952. In this case, the axes are oriented in a more mathematically plausible way, with the x-axis representing F1, and the y-axis representing F2, and the origin in the lower left.
If we think of the IPA vowel chart as aligning with a speaker who is facing to our left,
then the present plot is analogous to a speaker who is lying on their back with their right side facing us.
This F1-F2 plot shows data from 76 speakers producing ten different vowels. As we can see, there is considerable variability in the productions of these vowels, and considerable overlap between a number of vowel categories. Even when the vowel productions do not overlap, they are very near to one another in the acoustic space, making categorization difficult.
These vowels were then presented to listeners who were asked to report which vowel was being spoken. Although most vowels were identified correctly most of the time, there was a considerable number of vowels that were frequently misidentified in the data as well. When the data was re-plotted including only vowel tokens that were correctly identified 100% of the time, the overlap between the vowel categories in the acoustic space was reduced.
The question remains about how speakers and listeners are able to communicate with such a high degree of variation in the acoustic transmission. In fact, variation in the vowel space has been found for different speakers depending on age, sex, speech rate, speech style, dialect, and of course different languages also have diverse vowel spaces when compared to one another.
The question remains about how speakers and listeners are able to communicate with such a high degree of variation in the acoustic transmission. In fact, variation in the vowel space has been found for different speakers depending on age, sex, speech rate, speech style, dialect, and of course different languages also have diverse vowel spaces when compared to one another.
Here is an example of differences between male and female vowel spaces In another study of American English vowels. Here we see that the vowel space of women is larger than that of men. The black squares indicate the vowel space created from the mean formant values. White circles indicate the smallest value for each vowel, while the gray squares indicate the largest values. There are a variety of reasons for why this difference between men and women may exist, from physical characteristics to socialization and gender roles, but the whatever the source of this variation, the problem of how to equate these very different acoustic patterns to one another within the same perception or speech recognition system remains.
Despite the variability within the vowel space, the human vocal tract is still limited by its physical properties. It’s no wonder that vowel productions sometimes overlap with one another inside this space. However, we might expect that languages might have smaller vowel inventories as a result of this limitation in order to maximize the perceptual difference between vowel categories. Indeed, many languages of the world have small vowel inventories of only 3 or 5 vowel qualities. In these cases, the vowel space tends to resemble an inverted triangle, with high front and back vowels and a low central or front vowel. As more vowel contrasts are added into the acoustic space, the vowel qualities crowd and push each other around to maintain their perceptual distance – much the way people will arrange themselves to be evenly spaced within a crowded elevator.
Given the limited acoustic space which we have to work with inside our vocal tracts, vowel inventories in languages need to balance pressures leading to ease of perception, that is, maintaining vowel distinctions that are as far as possible from one another, with ease of production. This puts a limit on the number of vowel contrasts one language can make use of. With more contrasts to be identified, the level of perceptual confusion rises, which may lead to phenomena like mergers where contrasts between sounds are lost.

Log in if you want to mark this as completed

Readings for module 2 focus on the acoustic properties of consonants and vowels.

Reading

Wayland (Phonetics) – Chapter 8 – Acoustic Properties of Vowels and Consonants

An overview of the properties of vowels and consonants

Cho & Ladefoged – Variation and universals in VOT: evidence from 18 languages

Voice onset time (VOT) is known to vary with place of articulation.

Exploring Speech Acoustics

In the lab for module 2, you will continue explore speech acoustics through visualisations in Praat. The recordings and textgrids mentioned in the instructions are linked below.

Lab Commentary

Video commentary of the module 2 lab

That completes our main modules on articulatory and acoustic phonetics. You should now have a basic understanding of how vowels and consonants are produced in terms of the vocal tract and it’s articulators.  You should also have seen that we can “see” evidence of these articulations in speech acoustics, as represented in a spectrogram.  These acoustic cues can be used to “read” spectrograms, i.e. to be able to tell what someone has said by just looking at a spectrogram.   This is in essence what automated transcription systems attempt to do!  So, it’s important to know what acoustic properties of the speech waveform are important for identifying what has been said for speech recognition.  For speech generation, we want to make sure we generate the right acoustic features so that the waveform is understood as speech.

The next two modules will look at the aspects of acoustic phonetics from more of an engineering point of view. We’ll come back to more phon issues in later weeks, as learn more about TTS and ASR. In particular, we’ll look at the source-filter model from both theoretical and engineering points of view.

Making connections between the phonetics material and the speech technologies we’ll look at in the coming weeks will help you be an active learner. Just now, you probably have an understanding of issues in phonetics that will feed into how we design speech technologies, but only a vague idea of the ‘big picture’: the ideas may not yet be well-organised in your mind. Keep connecting and organising, and you’ll find that it does all join together.

What you should know from Module 2

Note: we’ll continue to discuss a lot of the ideas around the frequency domain, resonance and the source filter model in modules 3 and 4. 

What does a speech waveform (i.e. in the time-domain) represent?

  • Time versus amplitude graphs
  • Oscillation cycle
  • Period T and wavelength λ (we’ll revisit this in the next few modules)
  • Frequency (F=1/T)
  • What are “Hertz”?
  • How to calculate the frequency of a waveform by measuring pitch periods (Example in the ”waveform” video)

Types of waveform:

  •  Simple versus complex waves
  •  Periodic versus aperiodic waves
  •  Continuous versus transient waves
  •  Fundamental frequency (F0)

Spectrum:

  •  The spectrum as a representation of waveform frequency components
  •  What is the spectral envelope?
  •  Why do we consider F0 and harmonics to be “source” characteristics
  •  What the relationship between formants and resonance
  •  F0 is not a formant!

Spectrogram:

  •  What do the x and y axes of a spectrogram represent (e.g. in Praat)?

Acoustics of Vowels:

  •  What is the general relationship between formants (acoustics) to tongue position (articulation):
    •  F1 and vowel height
    •  F2 and vowel frontness
  •  Acoustic vowel space:
    • You don’t need to to know the specific formants associated with different vowels, but if you understand the relationship between height/frontness and formants you should be able to deduce this!

 Acoustics of Consonants:

  •  What does voicing look like on a spectrogram or spectrum?
  •  What are sonorants versus obstruents?
  •  Identify basic acoustic characterstics (i.e. on a spectrogram) of consonant manners: plosives (i.e., stops), aspiration (for plosives), fricatives, nasals, approximants.
  • Clues for place of articulation: stops, fricatives

 Voice Onset Time:

  •  What types of voice onset time are there?
  •  What does voice onset time look like on a spectrogram?
  •  How would you measure VOT (i.e. relative to a stop burst)?

Vowel space

  • How can you interpret an F1 vs F2 plot of vowel measurements
  • Relate this to vowel characteristics/the IPA vowel chart (e.g. tongue height and frontness)

Key Terms

  • waveform
  • amplitude
  • sine wave
  • period
  • frequency
  • wavelength
  • Hertz
  • fundamental frequency
  • harmonics
  • spectrum
  • spectrogram
  • spectral envelope
  • Fourier transform
  • formant
  • vowel height
  • vowel frontness
  • open vowel
  • closed vowel
  • sonorant
  • obstruent
  • plosive
  • voicing
  • voice onset time
  • aspiration