Module 6 – Speech Synthesis – waveform generation and connected speech

Manipulating recorded speech signals to create new utterances.
Log in

It’s time to bring together everything we learned earlier about speech signals and the source-filter model, and use that to develop a method for creating synthetic speech. This course only covers one method, which uses a database of recorded natural speech from which small waveform fragments are selected and then concatenated. These waveform fragments are of special sub-word units called diphones. Because they are taken from natural recordings, they will have an F0 and duration but these might not match the values we predicted in the front end. Therefore a method is needed to modify these properties, without changing the spectral envelope.

Here’s what you’re going to learn in the videos:

We’ve also included a second tab of videos on connected speech. We won’t go into this topic too much right now, but wanted to shared them with you in case they are helpful for thinking about potential errors in TTS (for assignment 1).

Lecture Slides

Slides for Thursday lecture (google) [updated 25/10/2023]

Total video to watch in this section: 38 minutes

Phones are not a suitable unit for waveform concatenation, so we used diphones, which capture co-articulation.

This video just has a plain transcript, not time-aligned to the videoIt's finally time to think about how to make synthetic speech.
The obvious approach is to record all possible speech sounds, and then rearrange them to say whatever we need to say.
So then the question is, what does 'all possible speech sounds' actually mean?
To make that practical, we'd need a fixed number of types so that we can record all of them and store them.
I hope it's obvious that words are not a suitable unit, because new words are invented all the time.
We can't record them all, for the same reason we could never write a dictionary that has all of the words of the language in it.
We need a smaller unit than the word: a sub-word unit.
So how about recordings of phonemes?
Well, a phoneme is an abstract category, so when we say it out loud and record it, we call it a 'phone'.
As we're about to see, phones are not suitable, but we can use them to define a suitable unit called the diphone.
Consider how speech is produced.
There is an articulatory target for each phone that we need to produce, to generate the correct acoustics, and that's so the listener hears the intended phone.
Speech production has evolved to be efficient, and that means minimising the energy expended by the speaker and maximising the rate of information.
That means that this target is only very briefly achieved before the talker very quickly moves on towards the next one.
I'm going to illustrate that using the second formant as an example of an acoustic parameter that is important for the speaker to produce, so that the listener hears the correct phone.
These are just sketches of formant trajectories, so I'm not going to label the axes too carefully.
This is, of course, 'time'.
The vertical axis is formant frequency.
To produce this vowel, the speaker has to reach a particular target value of F2.
It's the same in both of these words, because it's the same vowel.
That target is reached sufficiently accurately and consistently for the listener to be able to hear that vowel.
But look at the trajectory of that formant coming in to the target and leaving the target.
Near the boundaries of this phone, there's a lot of variation in the acoustic property.
Most of that variation is the consequence of the articulators having to arrive from the previous target or start moving towards the next target.
In other words: the context in which this phone is being produced.
For example, here the tongue was configured to produce [k] and then had to move to the position for the vowel [æ] and it was still on the way towards that target when voicing had commenced.
In other words, the start of the vowel is strongly influenced by where the tongue was coming from.
The same thing happens at the end.
The tongue is already heading off to make the next target for the [t] before the vowel is finished.
So the end of the vowel is strongly influenced by the right context.
Imagine we wanted to re-use parts of these recordings to make new words that we hadn't yet recorded.
Could we just segment them into phones and use those as the building blocks?
Let's try that.
Clearly, the [æ] in [k æ t] and the [æ] in [b æ t] are very different because of this contextual variation near the boundaries of the phone.
That means that we can't use an [æ] in [k æ t] to say [b æ t]: these are not interchangeable.
We're looking for sounds to be interchangeable so we can rearrange them to make new words.
There's a very simple way to capture that context-dependency, and that's just to redefine the unit from the phone to the diphone.
Diphones have about the same duration as a phone, but their boundaries are in the centres of phones.
They are the units of co-articulation.
Now look what happens.
The æ-t diphone in [k æ t] is very similar to the one in [b æ t], and these two sound units are relatively interchangeable.
We could use the æ-t diphone from [k æ t] to say [b æ t] or any other word involving æ-t.
W've simply redefined units from phones to diphones.
Diphones are units of co-articulation.
Now, of course, co-articulation spreads beyond just the previous or next phone.
This is just a first-order approximation, but it's much better than using context-independent phones as the building blocks for generating speech.
These two phones sound very different, because of context: they're not interchangeable.
They're not useful for making any new word with an [æ] in it.
In contrast, these two diphones sound very similar: they are relatively interchangeable.
We can use them to make new words requiring the [æ t] sequence.
They capture co-articulation.
Obviously, there's going to be rather more types of diphone than there are of phone.
If we have 45 phone classes, we're going to need 46^2 classes of diphone, roughly.
Not all are possible, but most are.
Why 46?
Because diphones need to have silence: the IPA forgot about that.
There are a lot more diphone types than phone types, but it's still a closed set, and it's very manageable.
If we record an inventory of natural spoken utterances and segment them into diphones, we can extract those diphones and rearrange them to make new utterances.
Here's a toy example, using just a few short phrases.
In general, we would actually have a database of 1000s or 10 000s of recorded sentences.
This picture is the database.
These are natural sentences, segmented into diphones.
Now let's make an utterance that was not recorded.
In other words, let's do speech synthesis!
This word is not in the database.
We've created it - we've synthesised it - by taking diphones (or short sequences of diphones) from the database and joining them together.
There's actually more than one way to make this word from that database.
Here's another way.
This way involves taking longer contiguous sequences of diphones and only making a join here; that might sound better.
The phone is not a useful building block directly, but we can use to define the diphone, which is a useful building block for synthesising speech.
We make new utterances by rearranging units from recorded speech.
The smallest unit we'll ever take is the diphone, but we already saw in the toy example that taking sequences of diphones is also possible.
That will involve making fewer joins between the sequences and it'll probably sound better, on average.
The key point here is that the joins in the synthetic speech will always be in the middle of phones and never at phone boundaries.
I've said that we're going to join sequences of diphones together, so the next step is actually to define precisely how that's done.
We need a method for concatenating recorded waveforms.

Log in if you want to mark this as completed
Concatenation of waveforms is a simple way of making synthetic speech, but we need to take care about how we do it.

This video just has a plain transcript, not time-aligned to the videoWe've defined the diphone and decided that it's a suitable unit for building new spoken utterances.
That's because it captures co-articulation, at least locally.
If we had a recorded database of speech containing at least one example of every possible diphone, we could simply concatenate diphone units (or sequences of diphone units) taken from those recordings, to create any new utterance we like.
Now we going to make a first attempt at waveform concatenation to discover whether it really is that simple.
Here are two diphone sequences.
On the left k-æ and on the right æ-t.
I'm going to use them to make a synthetic version of the word 'cat'.
So I'll simply concatenate them: join them together, like that.
It sounds like this: 'cat'
Let's hear that again: 'cat'
There's a distinct click in the middle.
Let's see why that is.
Zoom in, and we can see at the concatenation point (often simply, we say 'the join'), we can see a big discontinuity there.
We need to be a little bit more careful about where and how we make the join.
One option would obviously be to join the waveforms only at a point where they have the same amplitude, so we don't get the discontinuity.
One such point is where they both cross zero: at a zero crossing.
Here's a new version then, where the waveforms are joined at a zero crossing.
That's the zero crossing.
We've got no sudden discontinuity.
It sounds like this: 'cat'.
The click has gone, which is good, but the join is still audible.
One reason the join might be audible is the following problem.
I've zoomed back out a little bit, so we can clearly see the fundamental periods of the signal.
Those are coming from the sound source, which of course here is the vocal folds, because this is voiced speech.
The periodicity here does not properly align as we transition from one diphone to the next.
The periods aren't evenly spaced.
So we can do even better than this, even better than joining it at zero crossings.
We can join in a way that is called 'pitch-synchronous'.
That's going to involve adjusting the end point of the first diphone or the start point of the second diphone to keep the periodicity intact across the join.
To make pitch-synchronous joins, we need to annotate the waveform with a single consistent moment in time within each fundamental period that corresponds to the activity of the vocal folds.
Since we don't have access to the vocal folds of the speaker, we can only estimate these moments in time from the waveform.
That process is known as 'pitch marking'.
We're not going to describe the algorithm for doing pitch marking here.
Just assume that there is one and it's possible.
We'll have an algorithm that will find the fundamental periods and place a mark at a consistent point in each of them.
We should give a slightly more careful definition of some terms.
The term 'epoch' is used to indicate the moment of closure of the vocal folds.
That's a physical event during speech production, and we don't have direct access to that from the waveform.
A 'pitch mark'- which these are - is our estimate of the epoch, annotated onto a speech waveform.
In the literature, you'll find the terms 'epoch' and 'pitch mark' sometimes used interchangeably, even though they're not quite the same thing.
Here's our final way of concatenating the two waveforms, this time making sure the join is placed in a way that preserves the fundamental periodicity.
That is, so that the pitch marks are correctly spaced around the join.
If we zoom in, we can now see that the fundamental period is consistent across the join.
We can barely see where the join is.
It sounds like this: 'cat'.
The only evidence remaining of the join really is this change in amplitude.
We can still hear it a little.
We could still do better than that.
But this is our best way yet of concatenating two waveforms.
Let's compare the three methods.
The very naive way: 'cat'.
There was a big click because we didn't take care as to where we joined the waveforms.
Joining at a zero crossing: 'cat' - a less perceptible join.
Joining pitch-synchronously: 'cat'.
You might not hear much difference between the last two in this example, but pitch-synchronous joins are, on average, slightly less perceptible than zero-crossing joins.
That will matter much more when we're constructing a longer utterance with many waveform concatenation points.
We've learned, then, that care is needed when concatenating waveforms in order to minimise the chance that the listener will hear the join.
We can still do better.
You'll have noticed that the two diphone sequences I joined here had noticeably different amplitudes, and that was audible.
The fundamental frequency might also suddenly change at a concatenation point, and that will be audible too.
We need to develop a slightly more powerful way to manipulate speech that can solve these problems, to minimise the chance that a join is audible.

Log in if you want to mark this as completed
Cross-fading between two waveforms is an effective way to avoid some of the artefacts of concatenation.

This video just has a plain transcript, not time-aligned to the videoWe're synthesising speech by joining together recorded diphones: by concatenating their waveforms.
We've seen waveform concatenation must be done with care.
That's actually something that's true in general about signal processing.
Signal processing isn't just about theory; it's about the details and the care of the implementation.
We found that joining at zero crossings was much better than joining at an arbitrary point, but that a pitch-synchronous concatenation point can further reduce the chances of a join being audible.
But in general, it might not be possible to always find a place where the pitch marks are nicely spaced and there's a zero crossing.
So we need a more general-purpose method to make smooth sounding joins in all possible cases.
That actually turns out to be rather simple: by cross-fading.
That's called overlap-add.
Here's the solution.
As one song comes the end, this DJ has the next song ready on the other deck.
The DJ makes sure the two songs have a similar tempo and that the beats are aligned, then fades out the previous song whilst fading in the next song.
If the DJ does a good job of that, no one on the dance floor will notice and everyone keeps dancing happily.
We've already got our beats aligned by making pitch-synchronous joins.
Now we'll add the fade-out of the previous waveform while fading in the next one.
Here are two waveforms I'd like to concatenate.
I'm just going to simply fade out the first one.
In other words, I'm going to reduce the volume just at the end, down to zero, smoothly.
I'm going to fade in the second one: increase the volume from zero, smoothly.
Then overlap and add them.
Apply that fade out and fade in: that just scales down the amplitude to zero, smoothly.
I overlap them, and where the samples overlap, I'll sum them, I'll add them together.
Hence the name of the method: 'overlap-add'.
Here are the waveforms concatenated using overlap-add.
Now the join is quite hard to spot.
Overlap-add is a very general method; it's very useful.
But we're not quite as good as the DJ yet.
We've not taken care to match the fundamental period of the two waveforms before joining them.
So there may still be an audible discontinuity in the pitch.
Sudden changes in pitch are not natural and listeners will notice them.
You'll also remember that our front end has predicted values for F0 and for duration.
So we need a method to impose those onto the recorded speech because, in general, our recorded diphones won't have the desired F0 or duration that our front end predicted.
There's a single solution to all of those problems: a method for modifying both the fundamental frequency and the duration of speech directly on waveforms, called Time-Domain Pitch-Synchronous Overlap-and-Add.
Before understanding that, we'll need to revisit the concept of the pitch period, which will be the fundamental unit on which Time-Domain Pitch-Synchronous Overlap-and-Add will work, and remind ourselves how the pitch period relates the source-filter model.
Once we've done that, we can combine the idea of the pitch period with the technique of overlap-add and understand this powerful algorithm Time-Domain Pitch-Synchronous Overlap-and-Add, or TD-PSOLA for short.

Log in if you want to mark this as completed
This fundamental building block of speech waveforms offers a route to source-filter separation in the time domain.

This video just has a plain transcript, not time-aligned to the videoWe're going to talk a bit more about the concept of the pitch period.
We're going to remind ourselves how that relates to the source-filter model.
We take an impulse train and we pass it through a filter with resonances.
We can generate speech with the source-filter model.
One way to understand the filter is through its impulse response.
That is, we put a single impulse into the filter and observe the waveform that we get out.
That waveform is a pitch period.
Using this idea, we'll devise a way to break a natural speech signal down into a sequence of pitch periods, which later we can use to manipulate that speech signal.
Here's a short fragment of voiced natural speech.
The source-filter model tells us that this was generated using a simple excitation signal from the vocal folds (idealised as an impulse train in the model) passed through the vocal tract filter.
So this signal is a sequence of vocal tract impulse responses.
We need to separate the source and the filter so that we can manipulate them separately.
For example, to modify the fundamental frequency (F0) without changing the identity of the phone, we just need to manipulate the source and leave the filter alone.
We need to identify the filter from this signal so that we can preserve it.
One way to do that, of course, would be to fit a source-filter model and solve for the filter coefficients.
But remember that there are several ways to represent the filter, all containing the same information but in different domains.
Those coefficients (of the difference equation) are just one representation.
We could also talk about the frequency response of the filter, or about its impulse response.
The impulse response exists in the time domain.
So there is a way to get hold of the filter right here in the time domain.
We're going to use the impulse responses as our representation of the vocal tract filter.
That will require us to find the fundamental periods of the speech, so we can find those impulse responses from this waveform.
To find the fundamental periods we need to place pitch marks on the speech.
Pitch marks are estimates of epochs, which are the moments of vocal fold closure.
Pitch marking can be done automatically using methods which are beyond the scope of this video.
Here's a short utterance and its pitch marks.
It sounds like this: 'Nothing's impossible.'
Let's take a closer look.
If we look at this region, we see that the speech is transitioning from voiced to unvoiced.
So what do we do there about the pitch marks?
We still need to break this speech down into pitch periods, even when there is no fundamental frequency.
In other words, we need to find analysis frames.
So we'll just revert to a fixed frame rate - a fixed value of the fundamental period - and that's equivalent to just placing evenly-spaced pitch marks through the unvoiced regions.
Zoom back out and look a different part of the utterance: the end, where the speech is finishing and we end up in silence.
We'll see we also need to place pitch marks here, so that we can break this signal down into short parts.
So we also need to place pitch marks in silence.
In signal processing, silence is typically treated just like the rest of the signal.
Now we've placed pitch marks on our signal.
Aligning them precisely with the true epochs - which we don't have access to - is a little bit hard, although you can see that we can at least place one pitch mark in each fundamental period, and that's good enough.
Now the vocal tract filter has a potentially infinite impulse response, due to vocal tract resonance.
That means that one impulse response will generally not have decayed away to zero before the next one starts: they overlap.
That's particularly obvious in this signal, which is speech from a female speaker.
The fundamental period is about 4 ms, corresponding to an F0 of about 250 Hz.
Quite clearly, the impulse response has not decayed to zero before the next impulse starts.
The impulse responses overlap.
Because the pitch marks might not be precisely at the true epoch locations, and because the impulse responses overlap, we can't naively cut this waveform into individual impulse responses.
There's no place where we can do that.
Instead, we'll remember something we learned earlier about short-term analysis.
We will place an analysis frame - centred on each pitch mark - and apply a tapered window and use that to extract the pitch periods.
We'll find each epoch; we'll place an analysis frame around it; we';l extract a pitch period like that.
We'll do that for every pitch mark in the utterance to get a sequence of pitch periods.
Typically, we make the duration of these twice the fundamental period: 2 x T0.
Sometimes we just say 'two pitch periods'.
These look very much like the overlapping frames of a typical short-term analysis technique.
In fact, it's exactly the same.
The only difference here is that we're placing the frames pitch-synchronously and we're varying their duration in proportion to the fundamental period: we're making it 2 x T0.
These little pitch period building blocks capture only vocal tract filter information.
These little pitch period building blocks are going to be now used to synthesise speech signals by concatenating them using overlap-add.
Let's confirm then, that overlap-add of these pitch period does indeed reconstruct the original signal.
These pitch periods were extracted using a very simple triangular window.
So, if we overlap-add, everything adds back together correctly and we get almost the original signal back.
Let's listen to the original whole utterance from which this fragment was taken: 'Nothing's impossible.'
The reconstructed waveform from which the bottom fragment has come: 'Nothing's impossible.'
If you listen carefully on headphones, you'll hear some small artefacts.
But the waveform was reconstructed pretty well.
I've reconstructed this waveform without making any modifications to it, just to prove that I can decompose a speech signal into pitch periods and put it back together again.
This process is often called 'copy synthesis'.
We've just seen a new form of signal processing, which is pitch-synchronous.
It's essentially just the same as short-term analysis that we've seen before, except that we align the analysis frames to the fundamental periods of the signal and vary the analysis frame duration according to the fundamental period.
This video is called 'Pitch period', but we had to actually generalise that concept a little bit because the impulse responses of the vocal tract typically overlap in natural speech signals.
That makes it impossible to extract a single impulse response.
The generalisation we made was to extract overlapping frames and apply a tapered window in a way that makes reconstruction of the signal possible by simply using overlap-add.
This representation of the speech signal - as a sequence of pitch periods - is at the heart of the TD-PSOLA method, which can modify F0 and duration of speech signals in the time domain, using waveforms directly.
We're also going to be able to understand the interaction of the source and filter in the time domain, which is a process known as convolution.

Log in if you want to mark this as completed
Applying overlap-add techniques to pitch period waveforms allows the modification of F0 and duration without changing the phone identity. Note: we will talk a bit more about unit selection in the lecture, specifically target and join costs, but not the actual algorithm for selection (the Viterbi algorithm - though we will come back to this in the ASR modules).

This video just has a plain transcript, not time-aligned to the videoWaveform generation is going to be achieved using a stored database of natural utterances, from which we can select diphones or sequences of diphones.
Then we'll concatenate those waveform fragments.
Each of them will have the correct pronunciation, but in general they won't have the right values of F0 and duration, that our front end has predicted.
So we need to modify F0 and duration of recorded natural speech.
We discovered that we can represent the vocal tract filter as its impulse response.
We called that the 'pitch period'.
We saw that we can extract these pitch periods from natural speech using pitch-synchronous overlapping analysis frames with a tapered window, a kind of short-term analysis.
We're now going to combine that idea with the overlap-add technique for concatenating waveforms to create a general method, called TD-PSOLA, which can modify F0 and duration of recorded speech.
The Time-Domain Pitch-Synchronous Overlap-and-Add algorithm operates in the time domain: on waveforms.
This has a potential advantage over explicitly fitting a source-filter model, because we don't need to make any assumptions about the form of filter.
For example, we don't need to decide how many coefficients to have in the difference equation.
We could also avoid having to solve for those filter coefficients: that's a potentially error-prone process.
TD-PSOLA uses pitch-synchronous short-term analysis to extract pitch periods from natural speech.
Then it uses overlap-add to construct a modified waveform from those pitch period building blocks.
That's not as powerful as explicitly using a source filter model though,.
Because the filter response is represented in the time domain as its impulse response - the pitch periods - it's not in a form that we could easily modify, so TD-PSOLA can actually only modify F0 and duration.
Those are both source features.
It cannot modify the vocal tract filter: it will attempt to leave that unmodified.
Here's a reminder of how we can break down a natural speech waveform into its pitch periods.
Notice that they overlap in the original signal.
That's essential because we're using a tapered window on each analysis frame.
As we've seen before, we can overlap-add these pitch periods to reconstruct the original signal.
The tapered window that we apply to each pitch period makes sure that they do add back up to the original signal where they overlap.
Wherever these waveforms overlap, we just add them together, sample by sample.
We'll reconstruct the original waveform very closely; not exactly, but very closely.
This is just copy synthesis, but how about doing this?
What do you think that will sound like?
Pause the video.
In the lower waveform, the fundamental period is larger than in the original waveform, so it will have a lower F0.
So we'll perceive a lower pitch.
The duration has also being changed though: it will be longer.
But importantly, the individual pitch periods have not been changed.
We're still playing back a sequence of impulse responses.
So the vocal tract filter is the same and we'll hear the same phone.
That's changing the fundamental period and therefore F0.
I can also change duration by either duplicating or deleting pitch periods.
So let's reduce the duration of this one.
I'll lose this pitch period and overlap-and-add like this.
Now I've got a signal with about the same duration as the original, but with only 6 fundamental periods where there used to be 7.
I've reduced F0 without changing duration or changing the vocal tract filter.
I can apply any combination of sliding the pitch periods a little closer together or a little further apart, with duplicating or deleting pitch periods, to gain independent control over F0 and duration.
Let's see how that works.
Let's increase F0.
What are we going to do? slide them a little closer together to reduce the fundamental period.
Let's decrease F0: slide them a little further apart - increase the fundamental period.
Let's increase duration.
By duplicating a pitch period, we'll make some space for it.
Take one, copy it.
We've now made the duration longer.
I could repeat that; I could duplicate more and more of the pitch periods to make the duration longer and longer.
Let's decrease duration.
Pick a pitch period, lose it, and close the gap.
I can increase or decrease F0, increase or decrease duration, and combine those operations any way I like.
Let's put that all together and do speech synthesis!
Finally, then here's the complete process of waveform generation.
Here'a a database of diphone units.
Each row is a natural utterance that I've recorded.
Inside each of these diphones is a waveform fragment.
I've segmented these utterances into diphones.
I can select units from this database, and concatenate them to make a new word or utterance.
I'm not actually going to explain how we decide between different alternative ways of constructing the new utterance here.
That's a big problem to be solved in a longer course on speech synthesis.
The method's actually called 'unit selection', for obvious reasons.
Given this sequence of diphones from the database with the waveforms inside them, we now need to concatenate them.
Each diphone contains a waveform.
But it will be at whatever F0 and duration that the speaker said it at when we recorded the database.
That won't match our desired values, predicted from the front end.
So we need to apply TD-PSOLA to each diphone.
Inside these diphones are waveforms.
We're going to modify those waveforms using TD-PSOLA.
Here's a fragment of waveform.
We first break it down into its constituent pitch periods: replace it with these analysis frames, each of which is 2 fundamental periods long.
Then we apply TD-PSOLA to match the predictions from the front end.
Here, my front end predicted that F0 needs to actually be a little bit lower than what this speech was recorded at, and the duration needs to be a little bit longer.
So I just spread the pitch periods out a bit and duplicate one to make the duration correct.
This waveform now matches the predictions from the front end.
I'll finish with some examples of TD-PSOLA operating.
I'm going to just use a complete natural utterance as the starting point rather than concatenated diphones.
Here it is: 'Nothing's impossible.'
We can change F0, like this: 'Nothing's impossible.'
Or like this: 'Nothing's impossible.'
We can change duration; how about faster: 'Nothing's impossible.'
Or slower: 'Nothing's impossible.'
Those audio samples were actually made with my own very simple implementation of TD-PSOLA, and you'll hear a few artefacts, especially in the one that's slowed down.
A more sophisticated implementation would reduce those quite a lot, but there are still limits to how much you can modify duration and F0 away from the original values before TD-PSOLA degrades the signal.
That's the end of the material on waveform generation.
But just before we finish, there's one more topic to cover.
We've seen how the pitch period is a building block of speech, because it's the impulse response of the vocal tract filter.
It's a view of the filter in the time domain.
When we saw the source-filter model, we saw that we could combine the magnitude spectrum of the input signal with the magnitude frequency response of the filter to obtain the magnitude spectrum of the output signal.
We combine them simply by multiplying them in the magnitude spectrum domain.
In the time domain, the operation that combines the input signal with the filter's impulse response to produce the output signal is called 'convolution'.
We need to understand the relationship between multiplication in the magnitude spectral domain and convolution in the time domain.

Log in if you want to mark this as completed
A non-mathematical illustration of the equivalence of convolution (in the time domain), multiplication of magnitude spectra, and addition of log magnitude spectra.

This video just has a plain transcript, not time-aligned to the videoWe spent some time developing an understanding of the source-filter model.
We know that the filter can be described in several ways, including as its impulse response.
That led us to the idea of the pitch period as a building block for manipulating speech.
Now we're going to look at just how the source and filter combine in the time domain.
This operation is called 'convolution'.
First, a reminder of how the source-filter model operates, using the impulse response as our description of the filter.
Here's the filter, as its impulse response.
We need an input: an excitation signal.
Let's start with just one impulse.
If we input that to the filter, by definition, as output, we obtain the filter's impulse response.
If we put in a sequence of two impulses, we get two impulse responses out.
Put in three, get out three impulse responses, and so on.
I like to think of this as each impulse exciting an impulse response and writing that into the output at the appropriate time.
This impulse starts at 10 ms, and so it starts writing an impulse response into the output signal at 10 ms.
This impulse writes its impulse response at its time, and so on.
These impulses are quite widely spaced in time, and so each of these impulse responses has pretty much decayed to zero before we write the next impulse response in.
That's a very easy case.
If we decrease the fundamental period of the excitation, those impulse responses will be written into the output closer together in time, like this, or like this, and so on.
Now let's use an impulse train as the excitation.
The operation that combines these two waveforms to produce this waveform is called 'convolution'.
Convolution is written with a star symbol.
Let's do some convolution, whilst we inspect the magnitude spectrum of each of these signals.
That's the magnitude spectrum of the excitation, of the filter, and of the output.
The waveforms are sampled at 16 kHz, which means the Nyquist frequency will be 8 kHz.
I've zoomed in the frequency axis so we can see a little bit more detail: I'm just plotting it up to 3 kHz.
I'm going to increase the fundamental frequency of the excitation.
I'd like you to watch, first of all, what happens to the magnitude spectrum of the excitation itself.
Just watch this corner.
That behaves as expected.
We have harmonics spaced at the fundamental frequency and all integer multiples of that.
So, as the fundamental frequency goes up, those harmonics get more widely spaced.
I'm going to vary F0 again now, but this time I want you to look at the magnitude spectrum of the speech.
I'm decreasing F0.
What do we see?
Well we saw those harmonics getting closer and closer together because they're multiples of the fundamental frequency.
But the envelope remains constant because that's determined by the filter.
If you're watching really closely, you might have seen the absolute level of this go up and down a small amount.
That's simply because the amount of energy in the excitation signal varies with more and more impulses per second.
That's not important here.
It's the way that these two magnitude spectra combine that we're trying to understand.
So I'll just vary F0 a few more times and have a look at the different magnitude spectra and try and understand how this and this combine to make this.
Increasing F0 ... and decreasing F0.
One more time: increasing it again.
Let's try something else.
Let's keep the excitation fixed and then let's vary the filter.
That's a different filter ... and that's another one.
I'll do that a few more times.
Look at the magnitude spectrum of the output and see what varies there.
Only the filter is changing.
The excitation is constant.
So, this time, the harmonic structure remained the same and the envelope followed that of the filter.
We're getting a pretty good understanding, then, of how these two things combine to make the spectrum of the output.
The two waveforms combine using convolution in the time domain.
The Fourier transform converts convolution into multiplication.
That means that the source and the filter can be combined by multiplying their magnitude spectra.
That's something we mentioned in passing back when we were talking about the source-filter model.
But we should be a bit more careful.
Look very closely at the axes on the plots of the magnitude spectrum.
You'll see that we're using a logarithmic axis.
You can see that because the units are dB.
Taking the logarithm converts multiplication into addition.
So, in fact, the operation that combines the log magnitude spectrum of the excitation with the log magnitude spectrum of the filter is addition.
That's a really elegant and simple way to combine source and filter in the frequency domain.
We simply add together their log magnitude spectra!
There's nothing in this diagram that requires this to be an impulse train and this to be the impulse response of a filter.
They could be any two waveforms, and the operation of convolution is still defined.
That means that this relationship is just generally true.
Convolution of any two waveforms in the time domain is equivalent to summation of their log magnitude spectra.
Given a speech signal like this, and its log magnitude spectrum, we quite often want to recover the source or the filter from that signal.
For example, we'd like to recover this, which is the vocal tract frequency response (sometimes we use the more general idea of 'spectral envelope').
That means doing this equation in reverse.
Starting from this, we'd like to decompose it into a summation of two parts: one being the source, and one being the filter.
That's going to be much easier in the log magnitude spectrum domain than in the time domain, because reversing a summation is much easier than undoing a convolution.
We've seen that convolution in the time domain became multiplication in the magnitude spectral domain and then addition in the log magnitude spectral domain.
This has applications in Automatic Speech Recognition, where we'll want only the vocal tract filter's frequency response to use as a feature for identifying which phone is being said.
We'd like to get rid of the effects of the the source.
We're going to develop a simple method to isolate the filter's frequency response without having to fit a source-filter model or find the fundamental periods.
The method starts with the log magnitude spectrum and makes a further transformation into a new representation called the 'cepstrum', where the source and filter are very easy to separate.

Log in if you want to mark this as completed

These videos introduce some issues in connected speech from a phon point of view. We won’t go over this material in great detail right now, but wanted you to have the content in case it helps you think about errors in TTS (for assignment 1).

Total video to watch in this section: 39 minutes

Connected speech differs from the citation form.

This video just has a plain transcript, not time-aligned to the videoThis video will introduce the concept of connected speech with relation to the idea of citation forms. We will look at spectrograms and waveforms of speech in both citation and connected forms and consider the differences between these forms.
Transcription is a classic tool of linguistic description and analysis, but it imposes an artificial segmentation onto the continuous stream of speech. In fact, speech sounds overlap and bleed into one another during production, like “coloured eggs crushed on a conveyer”. This metaphor captures the idea that speech sounds are not discrete but instead influence those around them while gradually fading from one to another. We can use acoustic tools such as spectrograms to visualize and describe the speech sounds and to infer articulatory settings, but the often overlap to such an extent that it is difficult, if not impossible to independently identify one speech sound from another.
The style of speech can effect how clearly the individual sounds can be identified. The clearest forms of words are known as citation forms, but these forms typically do not occur in natural connected speech.
Citation forms are words spoken in isolation, in their fullest, most emphatic phonetic form. These forms are often used to exemplify and describe allophonic variation that occurs in surface forms. These forms are somewhat artificial in that spoken language is not made up of a string of discrete words. Connected forms, on the other hand, are spoken in natural utterances. Their forms are highly variable and depend on multiple factors such as sentence structure, speech rate and speech style. The variability of speech in connected utterances is visible in the acoustic representations that we’ve been looking at in the acoustic phonetics module.
Here we have an example of two utterances of the word solicitor, once in isolation, and once in a connected utterance. Let’s use what we’ve learned in acoustic phonetics to compare and contrast these forms.
The first thing that I notice when I look at these two spectrograms is that the citation form is considerably longer than the connected form. Now let’s look more closely at the individual speech sounds we can identify in the acoustics. The first things that jump out to me are the fricatives. I can identify them because they are intervals of aperiodic vibration in the waveform, and regions of high amplitude across a diffuse range high in the frequency spectrum. We can see two regions of fricative noise in both the citation and the connected speech form.
We can continue to compare the acoustic forms of these words, relating it to what we expect to see based on the phonemic form. In the citation form, I can identify regions of the spectrogram that correspond to each of the phonemic segments in the phonemic form.
However, if we look at the connected speech form, it is not as easy to align a phonetic transcription with the acoustic output.
In fact, we might say that some of the sounds have been deleted or removed somewhere between the phonemic representation and the phonetic output. Here I have highlighted the second fricative in both productions in blue, as well as the following segments. In the citation form, the vowel following the fricative, highlighted in red, is clearly visible, while in the connected form, it is has been omitted entirely. The following closure of the /t/ sounds, highlighted in green, is present in both sounds, but it is much shorter in the connected speech form than in the citation form.
In future videos we will consider why these differences occur, and the linguistic factors that help us to understand them.

Log in if you want to mark this as completed
Connected speech forms are highly variable as the result of a number of processes that apply to consonants and vowels.

This video just has a plain transcript, not time-aligned to the videoConnected speech forms are highly variable as the result of a number of processes that apply to consonants and vowels. This video will present some common connected speech processes with examples.
In order to fully describe a connected speech process, we need to know both the phonemic form as well as the surface phonetic form. We can then compare the two forms to each other, and describe the changes that we observe in terms of processes.
In order to describe the process that has taken place, we first need to recognize the sounds that it has applied to. For example here, the phonemic form of “green men” has an alveolar nasal followed by a bilabial nasal. However, we see that in the surface phonetic form, the alveolar nasal has become bilabial as well.
In order to fully describe a connected speech process, we need to know both the phonemic form as well as the surface phonetic production.
Some of these connected speech processes affect consonants, other affect vowels, and some affect both consonants and vowels. We’ll touch on each of these processes in turn, starting with assimilation.
Assimilation is a process by which a sound becomes more like an adjacent sound. It can effect any of the three articulatory parameters of consonants, voice, place or manner.
Here we have examples of each of these kinds of assimilation. In each case, the phonemic form the phrase is given followed by the phonetic realization. The assimilated sound and the conditioning environment are highlighted in red. In order to describe the assimilation, we need to identify the sound that has changed, and report the change that has occurred. For example, the voiced alveolar fricative /z/ in the phonemic form of the word “is” in “is Pete going?” is realized as a voiceless alveolar fricative, because it has assimilated in voicing to the following sound /p/.
Similar assimilation processes have taken place in each other other examples provided here. Pause the video and describe the assimilation that has taken place in each case.
We can also describe assimilations according to the position of the conditioning environment with respect to the sound that is assimilating, that is, whether it is the conditioning sound that comes before or after the sound that assimilates.
Here, the conditioning sounds have been highlighted in blue, while the assimilated sound is highlighted in red. In the perseverative assimilation examples, the predictive environments appear before the sound that is assimilated, while the predictive environments in anticipatory assimilation, appears after the sound that is assimilated.
Now let’s return to our “green men” example from earlier. This is an example of anticipatory place assimilation of the alveolar nasal /n/ to the bilabial place of the /m/ that follows it.
Again, the deletion is defined with relation to the phonemic form. So here we see that two sounds have been deleted from the phonemic form in order to result in the surface phonetic form that was produced. Notice that Sometimes when vowels are deleted adjacent consonants become syllabic, meaning that they serve as the nucleus of the syllable.
Consonant lenition, also known as weakening, is a process by which consonants become less strongly occluded. This might mean a phonemic stop that is realized as a fricative or an approximant in the phonetic form.
For example, a phonemic /g/ may be realized as either an approximant [ɰ] (as on the left) or a fricative [ɣ] (as on the right).
The final process we’re considering here is vowel reduction, where the quality of a vowel becomes more centralized with respect to the expected quality of the vowel in the citation form. This diagram shows a representation of the IPA vowel vowel chart. The purple ovoid near the centre represents a reduced vowel space that we might observe if we were to plot first and second formant values of reduced vowels. We can also represent vowel reduction in transcription, as in the examples here. The reduced vowels are highlighted in blue. Reduction of vowels tends to occur in unstressed or otherwise non-prominent syllables.

Log in if you want to mark this as completed
Prosody is the combination of speech properties that break speech into units of time, indicate the boundaries of those units, and highlight certain constituents.

This video just has a plain transcript, not time-aligned to the videoThis video will introduce the notion of prosodic structure and relate it to the acoustic phonetic dimensions that we have been learning about throughout the course.
Prosody is the combination of speech properties that break up speech into units of time (phrases, sentences, paragraphs), indicate the boundaries of those units (into statements, questions, internal or terminal phrases), and highlight or emphasize certain constituents within that domain.
Prosody is often portrayed as the rhythm and melody of speech.
These aspects of linguistic structure are conveyed using various combinations of duration, fundamental frequency, and intensity. That is, we can measure acoustic characteristics of spoken language, and use that phonetic detail to describe the hierarchical structure that we have traditionally observed impressionistically. These acoustic dimensions contribute in various ways to the prosodic structure of an utterance.
As we might expect, phonetic duration is very important to descriptions of time units in language, but it can also contribute to phrase boundaries and emphasis. Similarly, speakers use fundamental frequency and intensity to indicate the prosodic structure of phrases, as well as to emphasize one or more constituents within those phrases.
The remainder of this video will focus on phrasing and prominence, and the ways we can describe them using acoustics.
The following slides will illustrate some of the ways that acoustics can reveal constituent structure in spoken language. In particular, we’ll see examples of how duration of various elements, movements in the F0 pitch trace and glottalization can indicate the location of phrase boundaries in English.
Let’s start with some examples of the ways phonetic output can vary with changes in constituent structure.
Here I have an example of a string of words that can be grouped into phrases to form a sentence in (at least) two ways. The first way is to make a phrase of the words “when you make hollandaise slowly”, while the second way groups the word ”slowly” with the phrase that comes after it. Listen to these two sentences, and begin to think about how you measure the differences between them using the acoustic tools we have at our disposal.
1. [When you make hollandaise slowly,] it curdles.
2. [When you make hollandaise,] slowly it curdles.
One thing you might notice here, is that there is an appreciable interval of silence at the end of the phrase in each case. In the first example, the pause occurs after the word slowly, because the phrase ends after slowly. In the second example, the pause occurs before the word slowly, again because that is where the phrase ends. So, we can see that pauses, or silent intervals can be an acoustic indication of phrase boundaries.
Now let’s look a bit more closely at the speech inside these phrases. Let’s consider just the word hollandaise in both sentences. In the first example, hollandaise appears within a phrase, while in the second example, it appears at the end of a phrase. This difference in phrasal position is reflected in the acoustic duration of the final vowel in the word.
In example 1, the final [ei] vowel is 117 ms long, while in the second example, the vowel is 234 ms -- nearly 120 ms longer than the first instance. This phenomenon is known as final lengthening and affects words that appear at the end of phrases, especially before a pause.
1. [When you make hollandaise slowly,] it curdles. – [ei] in hollandaise = 117 ms
2. [When you make hollandaise,] slowly it curdles. – [ei] in hollandaise = 234 ms
Another acoustic indicator of phrase boundaries is a sudden or drastic change in F0. The spectrograms shown here now include a line indicating the fundamental frequency aligned with the speech output. In each of these sentences, the f0 drops at the end of the phrase before starting to rise up again to start the next phrase. In the first sentence, this drop in f0 is aligned with the end of the word slowly, while in the second sentence, it drops off at the end of hollandaise. In each case, sudden drop in f0 is aligned with the phrase boundary.
Finally, let’s look at the ends of both sentences. In this case, I am more interested in what is the same in the acoustic outputs than what is different. In both of the examples here, the sentences end with it curdles. The spectrograms shown here are limited to the word curdles only. Unsurprisingly, there are many similarities between these two utterances. Not only are the same words being spoken, meaning that we should expect the phones to be roughly the same, but they are also occupying more or less the same position in their respective phrases.
Notice that the second vocalic interval in each case is produced with glottalization, or a slowing of the vocal fold vibrations accompanied by irregularity of the wave cycle.
Now compare this phrase final utterance of curdles (on the left), to a production that appears at the beginning of a phrase (on the right). Notice that when the word appears near the beginning of a phrase, instead of at the end, there is no glottalization n in second half of the word. This is because glottalization is another cue to a terminal phrase boundary, indicating the end of an utterance.
In the preceding slides we have seen examples of pauses, final lengthening, movement in F0, and glottalization can all indicate the end of a spoken phrase. It’s important to note that although these cues can indicate phrase boundaries, they may not always be present in all cases. It is also possible for these acoustic phenomena to indicate something other than a phrase boundary, such as glottalization as an allophone of /t/.
Now let’s consider the acoustic correlates the second function of prosody: to make words more prominent. Prominence in language is sometimes also referred to as stress or emphasis.
Here we have two sentences that differ only in the word that is the most prominent. As a result, the meanings are quite different. The sentence in parentheses indicates the what is being contrasted. In the first sentence the emphasis is on the word “A”: She didn’t earn an A.
First let’s compare the prominent “A” in the first sentence, with the non-prominent “A” in the second sentence. We saw in the previous examples, that sounds are lengthened at the ends of phrases, and we see the same here. Both instances are quite long at around 400 ms. But although they are similar in duration, they are also different in a number of respects. On the left, the phonation is regular throughout, while on the right, the vowel becomes creaky toward the end. On the left, the pitch trace shows a rise-fall-rise pattern in f0 , while on the right, the pitch is level and then drops off.
In the second sentence, the emphasis is on the word ”earn”: She didn’t earn an A.
Again, we see that the emphasized production of earn has longer duration than the unemphasized version, and a rise in fo.
Prominence and constituent structure together are called Prosodic structure and may reflect relative predictability of elements in speech. Predictability is an indication of how easily a word can be guessed given its linguistic and real-world context and is affected by word frequency, in overall usage an in specific contexts. In both cases, the more frequent a word is, the more predictable it is.
For example, consider the following sentence with a word missing:
The children went outside to ______.
There are many words that could fill in the blank, but the word play is more predictable than the word bark, due to the relative frequency of each of these words overall and in this particular lexical context.
The more predictable a word is, the less likely it is to bear prominence in a sentence, and the less acoustic information is needed for it to be understood. For example, a phrase like I don’t know is very frequent and so requires relatively little acoustic information to be understood. In fact, it can even be understood with nothing but an intonational contour.

Log in if you want to mark this as completed
Glottalisation is known by many names including laryngealisation, creaky voice, creaky phonation and vocal fry.

This video just has a plain transcript, not time-aligned to the videoThis video presents acoustic characteristics of glottalization with spectrographic and waveform illustrations. At the end of this video, you should be able to recognize and identify glottalization in the spectrogram based on visual appearance and begin to talk about some of the linguistic structures that might influence whether or not it is present in English.
The articulatory description of a phonetic glottal stop implies that the vocal folds must make a complete closure during production, but in reality there are a number of different sounds that are perceived and transcribed as glottal stops that do not show complete closure. Here is an example from Hawai’ian. The spectrograms shown here demonstrate two very different realizations of a glottal stop phoneme. On the top, we see an articulation that was produced with complete closure. We can see that it was produced with closure because there is no shading in the spectrogram between the end of the vowel [e] and the start of the vowel [u]. On the bottom, we see an articulation that did not involve complete closure. Instead, what we see is a change in the rate of vocal fold vibration. This is visible in the spectrogram as a change in how widely spaced the vertical striations are. In the first vowel [e], we can see that the glottal pulses are evenly and regularly spaced. As we approach the region transcribed as a glottal stop, we can see that the striations become more widely spaced before returning to a more regular vibration cycle for following vowel. In the case of Hawai’ian, both of these phonetic sounds indicate a phonemic contrast. They are allophones of a glottal stop phoneme. However, this change in the vibratory pattern of the vocal folds does not always indicate contrast in this way (or at all). In phonetic terms, we refer to this change in vocal fold vibration as glottalization.
Glottalization is known by many names including laryngealization, creaky voice, creaky phonation and vocal fry. There are several types of glottalization, but it is generally characterised by irregular and widely-spaced glottal pulses. It often, though not always, provides the auditory impression of ‘a rapid series of taps, like a stick being run along a railing’ (Catford, 1964)
Glottal stops and glottalization play many roles in many languages. They can sometimes convey meaningful contrast, either as a glottal stop phoneme, or as contrastive phonation on consonants or vowels, they can be correlated with tonal contrasts, or they can indicate the boundaries of prosodic phrases.
Although /ʔ/ is not a phoneme in English, phonetic glottal stops perform a variety of roles as well. In many varieties of English [ʔ] is an allophone of /t/. In the example I have provided here, the phonetic form of the word ‘kitten’ is produced with a glottal stop allophone for the /t/ phoneme. This is particular to my variety of English, so you might not produce a glottal stop allophone there depending on the variety of English that you speak. Nevertheless, if you listen carefully, you should be able to hear the phonetic glottal stop.
Phonetic glottal stops may also appear before vowels at the beginnings of prominent words, such as at the start of an utterance, or in stressed positions in the phrase. The example I have provided here is a transcription of the word “apple”. Try saying “apple” with and without a glottal stop at the start. Can you feel the closure at your glottis?
Some varieties of English also have glottal stops or glottalization that appears before voiceless oral stop closures. This is sometimes referred to as glottal reinforcement.
And finally, English also uses glottal stops and glottalization to indicate the ends of phrases.
In English, glottalization is often known as creak or creaky voice. The next few slides will show some examples of creak in English utterances for you to examine. Each of the figures has the same format. On the left, the orthographic form of the speech has been provided. The bottom of each figure shows a spectrogram of the speech. Above the spectrogram is a waveform. The wave form has been taken from the interval indicated by the angled line segments connecting the waveform diagram to the spectrogram.
The spectrogram on this slide was made of a recording of the phrase “trodden road”. The waveform corresponds to the area of the spectrogram outlined in orange. This portion of the utterance shows glottalization by the widely and irregularly spaced glottal pulses, corresponding to a low f0. We can also see the irregularity of the vibration in the waveform where there is a bit of a cyclical, repetitive pattern, but with clear irregularities.
Here is another example of glottalization in English. Again the waveform corresponds to the area outlined in orange. This time, the glottal pulses are more regularly spaced in time, but with a much lower rate of vibration than other nearby voiced intervals in the preceding vowels, outlined in blue
Here we have another example of glottalization. In this case, the vocal fold vibration is so irregular that there doesn’t even seem to be an identifiable cycle in the waveform.
Notice that in this case, the glottalization continues throughout much of the utterance. It is not limited only the highlighted area. In fact, it overlaps with a number of other phones. Notice also that the glottalization doesn’t always persist throughout the entire duration of a phone. For example, the first vowel starts of with regularly spaced glottal pulses, then transitions into irregularly spaced pulses before the oral stop closure. This illustrates the need for a separate annotation tier to capture the alignment of glottalisation among other speech gestures as they are occurring.
In this final example of glottalization in English, the waveform oscillations vary from cycle to cycle both in the length of the period and the amplitude of the wave. Once again, we can see that the magnified interval is not the only instance of creak in the utterance.
In fact, there appears to be (at least) two sources of glottalization in this case. At the end of the utterance, we see an extended interval of creak, likely indicating the end of a phrase. But we can also see shorter intervals of creak, such as at the start of prominent word “entirely” (and the end of the previous, outlined in blue.
In this video we have introduced various functions of glottalization, and noted that it can take a number of phonetic forms. We have looked at a number of spectrograms and waveforms showing glottalization in English utterances, and pointed out that it can overlap with other articulations of speech in a way that necessitates another layer of annotation. We have also seen examples of glottalization from more than one linguistic source relating to the prosodic structure of the phrase: prominence and prosodic phrase boundaries.

Log in if you want to mark this as completed
Accents are language varieties that differ from one another only in pronunciation.

This video just has a plain transcript, not time-aligned to the videoThis video will introduce you to ways in which accents of English can differ so that you can figure out how to establish a set of symbols to use to transcribe both standard and non-standard and non-native varieties.
First, let’s start with the terms “accent” and “dialect”. While they are used interchangeably in casual speech, British linguistics use these terms with precise technical differences. Accents are language varieties that differ from one another only in pronunciation, while dialects, differ from one another in their grammars. Of course, this distinction as an idealization, as people who pronounce things differently from one another also tend to have grammatical and other structural differences in their languages as well.
As an aside, note that in linguistics accent may also refer to aspects of intonation and syllable prominence – that meaning is a completely unrelated use o f the term.
As phoneticians, we are mainly concerned here with differences in pronunciation.
Among accents of English, differences in pronunciation are systematic, meaning that we can describe how and where they will occur. These differences include both phonetic (or realizational) and phonological differences.
In realizational or phonetic differences, the same vowel phoneme is realized differently in different language varieties. That is, it has a different quality, but is otherwise the same as other varieties in terms of its distribution and function. The chart presented here shows examples of vowel phonemes that have realizational differences in US and UK varieties of English.
To hear an example of these vowels in both varieties, visit the Cambridge English Dictionary website, and click on the US and UK reference pronunciations for some or all of the words in this table.
There are three types of phonological differences between accents: lexical, distributional, and systemic
Lexical differences are those that apply to specific lexical items. An example of this is the difference between the words “tomato” and “potato” in American and British varieties of English.
Notice that the difference here is not just phonetic (though they do result in differences in pronunciation of the words). Lexical differences occur at the level of the phoneme, meaning that they are not predictable by phonological environment.
Sometimes different varieties will have the same phoneme, but different patterns of distribution determining how and where that phoneme is realized.
Take the phoneme /ɹ/, for instance. Many varieties of British English are non-rhotic, while most varieties of American English are rhotic. In non-rhotic varieties of English, the phoneme /ɹ/ is not pronounced at the ends of syllables but it does appear at the beginning of syllables and words. In contrast, syllable structure does not affect whether or not the /ɹ/ is pronounced.
A further distributional difference between rhotic and non-rhotic varieties is that non-rhotic varieties often insert a phonetic [ɹ] when a syllable final vowel is followed by a syllable initial vowel in the next word. This is called linking-r or intrusive-r, and does not apply to rhotic varieties of English.
Here is another example of distributional differences. In some varieties of American English, the phonemes /ɪ/ and /ɛ/ are both realized as [i] before /n/ (essentially a loss of contrast between these two vowel categories). On the other hand, the distinction is maintained in other phonological environments.
The third and final type of difference between accents can be described in terms of the number of phonological contrasts (phonemes) that exist in different varieties. This table shows some examples of vowel categories from three different English varieties.
By examining this table, we can see, for example, that Standard Southern British English and General American English both maintain 3 different vowel phonemes for the lexical set poll, pull, book, buck, while Standard Scottish English only uses 2 categories.
Note that some accent differences don’t fit neatly into the lexical/distributional/systemic categorization scheme just described. Here is an example of just such an exception. In Scottish English, vowel length is determined in part my lexical differences, and in part by morpho-phonological conditioning.
So the stressed vowel in bible is short, while the stressed vowel in libel is long. This is a lexical specification that appears in monomorphemic bi-syllabic words.
However, not all appearances of short and long allophones are lexically conditioned. Short allophones also predictably occur before voiceless stops and fricatives as well as before voiced stops, nasals, and /l/.
Long allophones occur before voiced fricatives and /ɹ/, as well as in open syllables, and also in morphologically complex words such as “agreed” where the past tense morpheme /-d/ results in the production of a long vowel.

Log in if you want to mark this as completed

Reading

Jurafsky & Martin – Section 8.4 – Diphone Waveform Synthesis

A simple way to generate a waveform is by concatenating speech units from a pre-recorded database. The database contains one recording of each required speech unit.

Taylor – Section 12.7 – Pitch and epoch detection

Only an outline of the main approaches, with little technical detail. Useful as a summary of why these tasks are harder than you might think.

This lab session will give you a chance to get some extra help on your assignment. The following are suggestions on things you might like to prepare before the lab for feedback.

Get some writing advice

Bring a sample of your writing (150-200 words). Be ready to share it with your tutor for feedback. It’s also good practice for you to get feedback on your writing from your peers, so feel free to do this in the lab or amongst yourselves at another time.

Generate some TTS error samples

Create some example synthetic speech, save the audio and spectrogram for your best example in each of the following categories, and come to the lab ready to explain to your tutor and classmates what the errors such as:

  1. Waveform generation mistakes, in which the front end did not make any mistakes, yet the synthetic waveform contains an audible problem.
  2. Synthetic speech in which there is a clearly visible join in the spectrogram, yet it is not audible.
  3. Mistakes in the TTS front-end
  4. Other things that sound weird!

You can use the speech zone forum on assignment 1 to ask for help.  You might find some of your questions are already answered in previous posts.

Private

  • You do not have permission to view this forum.

That’s the end of the Text-To-Speech part of the course. The last video of this module was a pointer forward into the Automatic Speech Recognition part of the course. It made it clear that all of our knowledge about speech signals, and in particular about separating the source and filter, will continue to be very useful.

What you should know

Note that Simon says in the videos that we don’t cover unit selection in this course, which is true for the videos but we do cover this in the lectures, readings and assignment.

  • Diphone: why use diphones? how does this relate to coarticulation? what goes into a diphone database?
  • Waveform concatenation, Overlap-add, Pitch period:
    • What are potential issues for concatenating waveforms? i.e. when do we get ‘glitches’ and ‘pops’ ?
    • Why are discontinuities at joins a problem?
    • How does Overlap-add and pitch synchronous concatenation help
  • TD-PSOLA
    • What can you manipulate with TD-PSOLA?
    • How does TD-PSOLA increase/decrease F0?
    • How does TD-PSOLA increase/decrease duration?
    • How does this relate to impulse responses? i.e. why doesn’t it change the actual phone/spectral envelope?
  • Unit selection: Target and join costs (lecture and J&M 8.5) – we haven’t covered the Viterbi algorithm in Module 6, but it will come up again in the ASR modules for this course.
  • Convolution : convolution in the time domain = multiplication in the frequency domain (i.e. see the application of filters in the frequency domain – module 4, e.g. low/band/high pass filters).  You should aim to understand this at a conceptual level.
  • Connected speech/citation speech:
    • identify examples of connected speech processes: assimilation, lenition, deletion, vowel reduction, as discussed in the lectures/videos in reference to potential rules helping us to generate correct pronunciations.

Key Terms

  • diphone, diphone database
  • concatenation
  • concatenative synthesis
  • waveform, waveform generation
  • diphone synthesis
  • unit selection
  • coarticulation
  • overlap-add
  • pitch period
  • TD-PSOLA
  • discontinuity
  • join, join cost
  • target, target cost
  • convolution
  • connected speech
  • assimilation
  • lenition
  • deletion
  • vowel reduction