Voice Onset Time (VOT)

The terms “voiced” and “voiceless” are used indicate whether the vocal folds are vibrating, but these terms do not tell us much about when those vibrations occur, relative to other events.

This video just has a plain transcript, not time-aligned to the videoVoicing is the result of vocal fold vibration. In previous videos we have seen how voicing appears in waveform, spectrum and spectrogram, and we know that it is independent of place and manner in consonant production. We use the terms “voiced” and “voiceless” to indicate whether the vocal folds are vibrating, but these terms do not tell us much about when those vibrations occur, relative to other events.
Consider for a moment a phrase like “come and get it”. When we examine the waveform and spectrogram, we can see that voicing starts and ends at various places throughout the phrase. Sometimes, voicing persists throughout a number of phones, across both consonants and vowels, as we see here. Without even listening to the audio, we can see that voicing is present by observing the voice bar here at the bottom of the spectrogram, and also by observing the periodic structure of the waveform.
At other times, voicing alternates from one phone to the next, as we see at the start of this phrase where a voiceless stop is followed by voicing in a vowel. (The opposite pattern appears at the end of the phrase, where the vowel is followed by a voiceless stop.)
If you look closely, you might also notice that there is a very brief interval of time where voicing stops during the closure of the [g] sound. If we remember that in the IPA the [g] symbol stands for a voiced bilabial stop, this becomes very curious indeed. Why is a voiced sound produced without voicing?
As it turns out, the alignment of voicing with oral closure during stops varies across languages, and we describe this alignment by referring to Voice Onset Time.
Voice Onset Time is typically used only for stops/plosives, so let’s briefly take a moment to consider why this is.
First, recall that some sounds are typically voiced:
- Sonorants
While others come in pairs of either voiced or voiceless sounds
- Obstruents
Voicing depends on air being able to flow through the larynx in order to set the vocal folds in motion. Sonorants are typically voiced continuously throughout the entire duration because the vocal tract is open, allowing air to flow freely.
Try this: hum a bilabial nasal [m]; how long can keep it going?— Forever!
In obstruents, on the other hand, there is constriction in the vocal tract, which impedes or obstructs airflow. This will have implications for how and when voicing can occur.
In fricatives, this doesn’t really get in the way of voicing too much (although there are aerodynamic tradeoffs in order for both voicing and frication to happen at the same time. We won’t get into those here.)
Try this: how long can you sustain a [v] sound? – quite a long time!
When we consider stops, things start to get a bit more interesting. Since stops involve complete closure of the vocal tract, there is a limit to how much air can flow in order to create voicing.
Try this: how long can you sustain voicing in a voiced bilabial stop [b]? (what happens when you try to keep voicing going longer?) (Sidenote: since voiceless stops don’t involve voicing, you should be able to hold that closure as long as you can hold your breath)
So, we can now see that there are physical limitations to how long voicing can overlap with oral stop closures. While voicing *could* begin and end at any time during sonorants or fricatives, it tends to persist throughout those sounds. In stops, however, the timing of voicing relative to stop closure and release is variable.
Phoneticians describe voice onset time (VOT) in plosives relative to the release burst. This is analogous to a number line, where the burst is located at zero. Voicing before the burst is measured in negative numbers, while voicing that begins after the burst is measured in positive numbers. Note that VOT (like most durations in speech) is typically reported in milliseconds.
As a result of this style of measurement, there are three types of VOT:
pre-voicing, or voicing lead
zero voicing, or short voicing lag
and post-voicing, or long voicing lag
First let’s consider the case of prevoiced stops. Stops that are produced with prevoicing, or negative VOT, will show evidence of voicing during the oral closure, followed by a release burst.
This spectrogram shows an example of a voiced bilabial stop [b], produced between two vowels. Here we can see that voicing continues throughout the stop closure, which is shaded in gray. Voicing is evident in both the waveform, where periodic oscillations are present, as well as in the spectrogram where we can see a voice bar and vertical striations indicative of glottal pulses. The duration of voicing prior to the release of oral closure is 158 ms, which we report as a negative voice onset time of -158 ms.
Stops produced with zero voice onset time have voicing that begins simultaneously (or nearly simultaneously) with the release of oral closure.
This spectrogram shows an example of a voiceless bilabial stop [p], produced between two vowels. We can see the closure of the stop both in the waveform where the signal is flat, and in the spectrogram where there is no shading anywhere in the frequency range. The release burst is shaded in gray, and we can see that the burst duration is short and voicing begins immediately after that release. We often refer to this type of stop release as having “zero” VOT, but often it in fact involves a very short lag of a few milliseconds. In this case the lag lasts for 13 ms after the initial release burst.
The third type of VOT is post-voicing, also called long-voicing lag or positive VOT.
Stops that are produced with positive VOT will typically have no evidence of voicing during oral closure, and the release burst will be followed by an interval or aspiration, or turbulent noise resembling frication.
This spectrogram shows an example of an aspirated voiceless bilabial stop [pʰ] produced between two vowels. Here again we can see that the stop closure is voiceless by examining the waveform and spectrogram, though you may notice that voicing does not end immediately when the closure begins. This is known as “residual voicing” and is quite common, even in voiceless stops. In this case, the burst release is followed by a bit of noise, which often appears in stops with long-lag VOT. We call this noise aspiration. If we look at the waveform and spectrogram here, we can see that this closely resembles a fricative, and indeed aspiration noise is a type of frication. Because voicing begins sometime after the release of the oral closure, we report the 62 ms of lag as a positive number.
So far we have seen how voicing may align with the closure and release phases of stops. Now we will think a bit about how this aligns with the IPA. The IPA is a system of phonetic transcription based on articulatory parameters, but the precise alignment to articulatory (and acoustic) events is generally not specified. This is in part because one of the main goals of the IPA is to capture linguistically relevant contrasts in the sound system of a language – not to faithfully represent the particulars of any one production of speech.
In fact, studies have shown that not all voiceless sounds are voiceless in the same way. We might think, for example, that all voiceless unaspirated stops have the same VOT values. Perhaps we might expect them all to have zero (or small positive) VOT of roughly the same magnitude. However, place of articulation actually has an effect on VOT, with bilabial sounds having the shortest VOT, followed by alveolars, then by velars.
Languages may also differ as to how they maintain voicing contrasts in their sound systems, and linguists often use symbols in a confusing way when describing those contrasts. For example, both Spanish and English are said to have voiced and voiceless stops, which we transcribe using the appropriate IPA symbols for such sounds.
However, if we look at the acoustic productions of these sounds, we see that voiced stops in English have zero VOT, while voiceless stops have positive VOT (and aspiration). In spanish voiced stops are are pre-voiced, while voiceless stops have zero VOT and are unaspirated. Nevertheless, we use the [b] symbol to represent both the English zero VOT ‘b’ sound as well as the negative VOT ‘b’ of Spanish.
Furthermore, some languages even have more than 2 voicing contrasts, adding complexity to the question of how to represent such productions with a phonetic transcription system.
For example, Thai maintains 3 voicing categories: voiced, voiceless unaspirated and voiceless aspirated, while
Hindi maintains a 4: voiced, voiced aspirated, voiceless unaspirated, and voiceless aspirated.
So, despite using the same terminology to identify voiced and voiceless sounds, languages can and do differ with respect to how they align voicing with stop closure, and these differences may not always be apparent from phonetic transcriptions alone.

Log in if you want to mark this as completed
Excellent 78
Very helpful 17
Quite helpful 9
Slightly helpful 5
Confusing 5
No rating 0
My brain hurts 2
Really quite difficult 6
Getting harder 31
Just right 71
Pretty simple 4
No rating 0