Convolution

A non-mathematical illustration of the equivalence of convolution (in the time domain), multiplication of magnitude spectra, and addition of log magnitude spectra.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoWe spent some time developing an understanding of the source-filter model.
We know that the filter can be described in several ways, including as its impulse response.
That led us to the idea of the pitch period as a building block for manipulating speech.
Now we're going to look at just how the source and filter combine in the time domain.
This operation is called 'convolution'.
First, a reminder of how the source-filter model operates, using the impulse response as our description of the filter.
Here's the filter, as its impulse response.
We need an input: an excitation signal.
Let's start with just one impulse.
If we input that to the filter, by definition, as output, we obtain the filter's impulse response.
If we put in a sequence of two impulses, we get two impulse responses out.
Put in three, get out three impulse responses, and so on.
I like to think of this as each impulse exciting an impulse response and writing that into the output at the appropriate time.
This impulse starts at 10 ms, and so it starts writing an impulse response into the output signal at 10 ms.
This impulse writes its impulse response at its time, and so on.
These impulses are quite widely spaced in time, and so each of these impulse responses has pretty much decayed to zero before we write the next impulse response in.
That's a very easy case.
If we decrease the fundamental period of the excitation, those impulse responses will be written into the output closer together in time, like this, or like this, and so on.
Now let's use an impulse train as the excitation.
The operation that combines these two waveforms to produce this waveform is called 'convolution'.
Convolution is written with a star symbol.
Let's do some convolution, whilst we inspect the magnitude spectrum of each of these signals.
That's the magnitude spectrum of the excitation, of the filter, and of the output.
The waveforms are sampled at 16 kHz, which means the Nyquist frequency will be 8 kHz.
I've zoomed in the frequency axis so we can see a little bit more detail: I'm just plotting it up to 3 kHz.
I'm going to increase the fundamental frequency of the excitation.
I'd like you to watch, first of all, what happens to the magnitude spectrum of the excitation itself.
Just watch this corner.
That behaves as expected.
We have harmonics spaced at the fundamental frequency and all integer multiples of that.
So, as the fundamental frequency goes up, those harmonics get more widely spaced.
I'm going to vary F0 again now, but this time I want you to look at the magnitude spectrum of the speech.
I'm decreasing F0.
What do we see?
Well we saw those harmonics getting closer and closer together because they're multiples of the fundamental frequency.
But the envelope remains constant because that's determined by the filter.
If you're watching really closely, you might have seen the absolute level of this go up and down a small amount.
That's simply because the amount of energy in the excitation signal varies with more and more impulses per second.
That's not important here.
It's the way that these two magnitude spectra combine that we're trying to understand.
So I'll just vary F0 a few more times and have a look at the different magnitude spectra and try and understand how this and this combine to make this.
Increasing F0 ... and decreasing F0.
One more time: increasing it again.
Let's try something else.
Let's keep the excitation fixed and then let's vary the filter.
That's a different filter ... and that's another one.
I'll do that a few more times.
Look at the magnitude spectrum of the output and see what varies there.
Only the filter is changing.
The excitation is constant.
So, this time, the harmonic structure remained the same and the envelope followed that of the filter.
We're getting a pretty good understanding, then, of how these two things combine to make the spectrum of the output.
The two waveforms combine using convolution in the time domain.
The Fourier transform converts convolution into multiplication.
That means that the source and the filter can be combined by multiplying their magnitude spectra.
That's something we mentioned in passing back when we were talking about the source-filter model.
But we should be a bit more careful.
Look very closely at the axes on the plots of the magnitude spectrum.
You'll see that we're using a logarithmic axis.
You can see that because the units are dB.
Taking the logarithm converts multiplication into addition.
So, in fact, the operation that combines the log magnitude spectrum of the excitation with the log magnitude spectrum of the filter is addition.
That's a really elegant and simple way to combine source and filter in the frequency domain.
We simply add together their log magnitude spectra!
There's nothing in this diagram that requires this to be an impulse train and this to be the impulse response of a filter.
They could be any two waveforms, and the operation of convolution is still defined.
That means that this relationship is just generally true.
Convolution of any two waveforms in the time domain is equivalent to summation of their log magnitude spectra.
Given a speech signal like this, and its log magnitude spectrum, we quite often want to recover the source or the filter from that signal.
For example, we'd like to recover this, which is the vocal tract frequency response (sometimes we use the more general idea of 'spectral envelope').
That means doing this equation in reverse.
Starting from this, we'd like to decompose it into a summation of two parts: one being the source, and one being the filter.
That's going to be much easier in the log magnitude spectrum domain than in the time domain, because reversing a summation is much easier than undoing a convolution.
We've seen that convolution in the time domain became multiplication in the magnitude spectral domain and then addition in the log magnitude spectral domain.
This has applications in Automatic Speech Recognition, where we'll want only the vocal tract filter's frequency response to use as a feature for identifying which phone is being said.
We'd like to get rid of the effects of the the source.
We're going to develop a simple method to isolate the filter's frequency response without having to fit a source-filter model or find the fundamental periods.
The method starts with the log magnitude spectrum and makes a further transformation into a new representation called the 'cepstrum', where the source and filter are very easy to separate.