Waveform concatenation

Concatenation of waveforms is a simple way of making synthetic speech, but we need to take care about how we do it.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoWe've defined the diphone and decided that it's a suitable unit for building new spoken utterances.
That's because it captures co-articulation, at least locally.
If we had a recorded database of speech containing at least one example of every possible diphone, we could simply concatenate diphone units (or sequences of diphone units) taken from those recordings, to create any new utterance we like.
Now we going to make a first attempt at waveform concatenation to discover whether it really is that simple.
Here are two diphone sequences.
On the left k-æ and on the right æ-t.
I'm going to use them to make a synthetic version of the word 'cat'.
So I'll simply concatenate them: join them together, like that.
It sounds like this: 'cat'
Let's hear that again: 'cat'
There's a distinct click in the middle.
Let's see why that is.
Zoom in, and we can see at the concatenation point (often simply, we say 'the join'), we can see a big discontinuity there.
We need to be a little bit more careful about where and how we make the join.
One option would obviously be to join the waveforms only at a point where they have the same amplitude, so we don't get the discontinuity.
One such point is where they both cross zero: at a zero crossing.
Here's a new version then, where the waveforms are joined at a zero crossing.
That's the zero crossing.
We've got no sudden discontinuity.
It sounds like this: 'cat'.
The click has gone, which is good, but the join is still audible.
One reason the join might be audible is the following problem.
I've zoomed back out a little bit, so we can clearly see the fundamental periods of the signal.
Those are coming from the sound source, which of course here is the vocal folds, because this is voiced speech.
The periodicity here does not properly align as we transition from one diphone to the next.
The periods aren't evenly spaced.
So we can do even better than this, even better than joining it at zero crossings.
We can join in a way that is called 'pitch-synchronous'.
That's going to involve adjusting the end point of the first diphone or the start point of the second diphone to keep the periodicity intact across the join.
To make pitch-synchronous joins, we need to annotate the waveform with a single consistent moment in time within each fundamental period that corresponds to the activity of the vocal folds.
Since we don't have access to the vocal folds of the speaker, we can only estimate these moments in time from the waveform.
That process is known as 'pitch marking'.
We're not going to describe the algorithm for doing pitch marking here.
Just assume that there is one and it's possible.
We'll have an algorithm that will find the fundamental periods and place a mark at a consistent point in each of them.
We should give a slightly more careful definition of some terms.
The term 'epoch' is used to indicate the moment of closure of the vocal folds.
That's a physical event during speech production, and we don't have direct access to that from the waveform.
A 'pitch mark'- which these are - is our estimate of the epoch, annotated onto a speech waveform.
In the literature, you'll find the terms 'epoch' and 'pitch mark' sometimes used interchangeably, even though they're not quite the same thing.
Here's our final way of concatenating the two waveforms, this time making sure the join is placed in a way that preserves the fundamental periodicity.
That is, so that the pitch marks are correctly spaced around the join.
If we zoom in, we can now see that the fundamental period is consistent across the join.
We can barely see where the join is.
It sounds like this: 'cat'.
The only evidence remaining of the join really is this change in amplitude.
We can still hear it a little.
We could still do better than that.
But this is our best way yet of concatenating two waveforms.
Let's compare the three methods.
The very naive way: 'cat'.
There was a big click because we didn't take care as to where we joined the waveforms.
Joining at a zero crossing: 'cat' - a less perceptible join.
Joining pitch-synchronously: 'cat'.
You might not hear much difference between the last two in this example, but pitch-synchronous joins are, on average, slightly less perceptible than zero-crossing joins.
That will matter much more when we're constructing a longer utterance with many waveform concatenation points.
We've learned, then, that care is needed when concatenating waveforms in order to minimise the chance that the listener will hear the join.
We can still do better.
You'll have noticed that the two diphone sequences I joined here had noticeably different amplitudes, and that was audible.
The fundamental frequency might also suddenly change at a concatenation point, and that will be audible too.
We need to develop a slightly more powerful way to manipulate speech that can solve these problems, to minimise the chance that a join is audible.