Diphone

Phones are not a suitable unit for waveform concatenation, so we used diphones, which capture co-articulation.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoIt's finally time to think about how to make synthetic speech.
The obvious approach is to record all possible speech sounds, and then rearrange them to say whatever we need to say.
So then the question is, what does 'all possible speech sounds' actually mean?
To make that practical, we'd need a fixed number of types so that we can record all of them and store them.
I hope it's obvious that words are not a suitable unit, because new words are invented all the time.
We can't record them all, for the same reason we could never write a dictionary that has all of the words of the language in it.
We need a smaller unit than the word: a sub-word unit.
So how about recordings of phonemes?
Well, a phoneme is an abstract category, so when we say it out loud and record it, we call it a 'phone'.
As we're about to see, phones are not suitable, but we can use them to define a suitable unit called the diphone.
Consider how speech is produced.
There is an articulatory target for each phone that we need to produce, to generate the correct acoustics, and that's so the listener hears the intended phone.
Speech production has evolved to be efficient, and that means minimising the energy expended by the speaker and maximising the rate of information.
That means that this target is only very briefly achieved before the talker very quickly moves on towards the next one.
I'm going to illustrate that using the second formant as an example of an acoustic parameter that is important for the speaker to produce, so that the listener hears the correct phone.
These are just sketches of formant trajectories, so I'm not going to label the axes too carefully.
This is, of course, 'time'.
The vertical axis is formant frequency.
To produce this vowel, the speaker has to reach a particular target value of F2.
It's the same in both of these words, because it's the same vowel.
That target is reached sufficiently accurately and consistently for the listener to be able to hear that vowel.
But look at the trajectory of that formant coming in to the target and leaving the target.
Near the boundaries of this phone, there's a lot of variation in the acoustic property.
Most of that variation is the consequence of the articulators having to arrive from the previous target or start moving towards the next target.
In other words: the context in which this phone is being produced.
For example, here the tongue was configured to produce [k] and then had to move to the position for the vowel [æ] and it was still on the way towards that target when voicing had commenced.
In other words, the start of the vowel is strongly influenced by where the tongue was coming from.
The same thing happens at the end.
The tongue is already heading off to make the next target for the [t] before the vowel is finished.
So the end of the vowel is strongly influenced by the right context.
Imagine we wanted to re-use parts of these recordings to make new words that we hadn't yet recorded.
Could we just segment them into phones and use those as the building blocks?
Let's try that.
Clearly, the [æ] in [k æ t] and the [æ] in [b æ t] are very different because of this contextual variation near the boundaries of the phone.
That means that we can't use an [æ] in [k æ t] to say [b æ t]: these are not interchangeable.
We're looking for sounds to be interchangeable so we can rearrange them to make new words.
There's a very simple way to capture that context-dependency, and that's just to redefine the unit from the phone to the diphone.
Diphones have about the same duration as a phone, but their boundaries are in the centres of phones.
They are the units of co-articulation.
Now look what happens.
The æ-t diphone in [k æ t] is very similar to the one in [b æ t], and these two sound units are relatively interchangeable.
We could use the æ-t diphone from [k æ t] to say [b æ t] or any other word involving æ-t.
W've simply redefined units from phones to diphones.
Diphones are units of co-articulation.
Now, of course, co-articulation spreads beyond just the previous or next phone.
This is just a first-order approximation, but it's much better than using context-independent phones as the building blocks for generating speech.
These two phones sound very different, because of context: they're not interchangeable.
They're not useful for making any new word with an [æ] in it.
In contrast, these two diphones sound very similar: they are relatively interchangeable.
We can use them to make new words requiring the [æ t] sequence.
They capture co-articulation.
Obviously, there's going to be rather more types of diphone than there are of phone.
If we have 45 phone classes, we're going to need 46^2 classes of diphone, roughly.
Not all are possible, but most are.
Why 46?
Because diphones need to have silence: the IPA forgot about that.
There are a lot more diphone types than phone types, but it's still a closed set, and it's very manageable.
If we record an inventory of natural spoken utterances and segment them into diphones, we can extract those diphones and rearrange them to make new utterances.
Here's a toy example, using just a few short phrases.
In general, we would actually have a database of 1000s or 10 000s of recorded sentences.
This picture is the database.
These are natural sentences, segmented into diphones.
Now let's make an utterance that was not recorded.
In other words, let's do speech synthesis!
This word is not in the database.
We've created it - we've synthesised it - by taking diphones (or short sequences of diphones) from the database and joining them together.
There's actually more than one way to make this word from that database.
Here's another way.
This way involves taking longer contiguous sequences of diphones and only making a join here; that might sound better.
The phone is not a useful building block directly, but we can use to define the diphone, which is a useful building block for synthesising speech.
We make new utterances by rearranging units from recorded speech.
The smallest unit we'll ever take is the diphone, but we already saw in the toy example that taking sequences of diphones is also possible.
That will involve making fewer joins between the sequences and it'll probably sound better, on average.
The key point here is that the joins in the synthetic speech will always be in the middle of phones and never at phone boundaries.
I've said that we're going to join sequences of diphones together, so the next step is actually to define precisely how that's done.
We need a method for concatenating recorded waveforms.