TD-PSOLA

Applying overlap-add techniques to pitch period waveforms allows the modification of F0 and duration without changing the phone identity. Note: we will talk a bit more about unit selection in the lecture, specifically target and join costs, but not the actual algorithm for selection (the Viterbi algorithm - though we will come back to this in the ASR modules).

This video just has a plain transcript, not time-aligned to the videoWaveform generation is going to be achieved using a stored database of natural utterances, from which we can select diphones or sequences of diphones.
Then we'll concatenate those waveform fragments.
Each of them will have the correct pronunciation, but in general they won't have the right values of F0 and duration, that our front end has predicted.
So we need to modify F0 and duration of recorded natural speech.
We discovered that we can represent the vocal tract filter as its impulse response.
We called that the 'pitch period'.
We saw that we can extract these pitch periods from natural speech using pitch-synchronous overlapping analysis frames with a tapered window, a kind of short-term analysis.
We're now going to combine that idea with the overlap-add technique for concatenating waveforms to create a general method, called TD-PSOLA, which can modify F0 and duration of recorded speech.
The Time-Domain Pitch-Synchronous Overlap-and-Add algorithm operates in the time domain: on waveforms.
This has a potential advantage over explicitly fitting a source-filter model, because we don't need to make any assumptions about the form of filter.
For example, we don't need to decide how many coefficients to have in the difference equation.
We could also avoid having to solve for those filter coefficients: that's a potentially error-prone process.
TD-PSOLA uses pitch-synchronous short-term analysis to extract pitch periods from natural speech.
Then it uses overlap-add to construct a modified waveform from those pitch period building blocks.
That's not as powerful as explicitly using a source filter model though,.
Because the filter response is represented in the time domain as its impulse response - the pitch periods - it's not in a form that we could easily modify, so TD-PSOLA can actually only modify F0 and duration.
Those are both source features.
It cannot modify the vocal tract filter: it will attempt to leave that unmodified.
Here's a reminder of how we can break down a natural speech waveform into its pitch periods.
Notice that they overlap in the original signal.
That's essential because we're using a tapered window on each analysis frame.
As we've seen before, we can overlap-add these pitch periods to reconstruct the original signal.
The tapered window that we apply to each pitch period makes sure that they do add back up to the original signal where they overlap.
Wherever these waveforms overlap, we just add them together, sample by sample.
We'll reconstruct the original waveform very closely; not exactly, but very closely.
This is just copy synthesis, but how about doing this?
What do you think that will sound like?
Pause the video.
In the lower waveform, the fundamental period is larger than in the original waveform, so it will have a lower F0.
So we'll perceive a lower pitch.
The duration has also being changed though: it will be longer.
But importantly, the individual pitch periods have not been changed.
We're still playing back a sequence of impulse responses.
So the vocal tract filter is the same and we'll hear the same phone.
That's changing the fundamental period and therefore F0.
I can also change duration by either duplicating or deleting pitch periods.
So let's reduce the duration of this one.
I'll lose this pitch period and overlap-and-add like this.
Now I've got a signal with about the same duration as the original, but with only 6 fundamental periods where there used to be 7.
I've reduced F0 without changing duration or changing the vocal tract filter.
I can apply any combination of sliding the pitch periods a little closer together or a little further apart, with duplicating or deleting pitch periods, to gain independent control over F0 and duration.
Let's see how that works.
Let's increase F0.
What are we going to do? slide them a little closer together to reduce the fundamental period.
Let's decrease F0: slide them a little further apart - increase the fundamental period.
Let's increase duration.
By duplicating a pitch period, we'll make some space for it.
Take one, copy it.
We've now made the duration longer.
I could repeat that; I could duplicate more and more of the pitch periods to make the duration longer and longer.
Let's decrease duration.
Pick a pitch period, lose it, and close the gap.
I can increase or decrease F0, increase or decrease duration, and combine those operations any way I like.
Let's put that all together and do speech synthesis!
Finally, then here's the complete process of waveform generation.
Here'a a database of diphone units.
Each row is a natural utterance that I've recorded.
Inside each of these diphones is a waveform fragment.
I've segmented these utterances into diphones.
I can select units from this database, and concatenate them to make a new word or utterance.
I'm not actually going to explain how we decide between different alternative ways of constructing the new utterance here.
That's a big problem to be solved in a longer course on speech synthesis.
The method's actually called 'unit selection', for obvious reasons.
Given this sequence of diphones from the database with the waveforms inside them, we now need to concatenate them.
Each diphone contains a waveform.
But it will be at whatever F0 and duration that the speaker said it at when we recorded the database.
That won't match our desired values, predicted from the front end.
So we need to apply TD-PSOLA to each diphone.
Inside these diphones are waveforms.
We're going to modify those waveforms using TD-PSOLA.
Here's a fragment of waveform.
We first break it down into its constituent pitch periods: replace it with these analysis frames, each of which is 2 fundamental periods long.
Then we apply TD-PSOLA to match the predictions from the front end.
Here, my front end predicted that F0 needs to actually be a little bit lower than what this speech was recorded at, and the duration needs to be a little bit longer.
So I just spread the pitch periods out a bit and duplicate one to make the duration correct.
This waveform now matches the predictions from the front end.
I'll finish with some examples of TD-PSOLA operating.
I'm going to just use a complete natural utterance as the starting point rather than concatenated diphones.
Here it is: 'Nothing's impossible.'
We can change F0, like this: 'Nothing's impossible.'
Or like this: 'Nothing's impossible.'
We can change duration; how about faster: 'Nothing's impossible.'
Or slower: 'Nothing's impossible.'
Those audio samples were actually made with my own very simple implementation of TD-PSOLA, and you'll hear a few artefacts, especially in the one that's slowed down.
A more sophisticated implementation would reduce those quite a lot, but there are still limits to how much you can modify duration and F0 away from the original values before TD-PSOLA degrades the signal.
That's the end of the material on waveform generation.
But just before we finish, there's one more topic to cover.
We've seen how the pitch period is a building block of speech, because it's the impulse response of the vocal tract filter.
It's a view of the filter in the time domain.
When we saw the source-filter model, we saw that we could combine the magnitude spectrum of the input signal with the magnitude frequency response of the filter to obtain the magnitude spectrum of the output signal.
We combine them simply by multiplying them in the magnitude spectrum domain.
In the time domain, the operation that combines the input signal with the filter's impulse response to produce the output signal is called 'convolution'.
We need to understand the relationship between multiplication in the magnitude spectral domain and convolution in the time domain.

Log in if you want to mark this as completed
This video covers topics:
Excellent 73
Very helpful 5
Quite helpful 3
Slightly helpful 2
Confusing 2
No rating 0
My brain hurts 2
Really quite difficult 4
Getting harder 18
Just right 61
Pretty simple 0
No rating 0