Source-filter model

Finally, we arrive at a complete model of speech signals that can generate any speech sound.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoWe now have all the components needed to assemble a computational model of speech.
But, before we get into that, let's remind ourselves what the model will be used for.
Well, lots of things!
Primarily, understanding and explaining speech signals.
But it has engineering applications.
We could use it to manipulate speech.
For example, we might wish to change the fundamental frequency or the duration, without changing the spectral envelope: that gives us a route to creating entirely synthetic speech.
We might also use this model to extract features from speech, based only on the spectral envelope, without properties of the source, for use in Automatic Speech Recognition.
The model is going to work in the time domain: in the domain of signals, but we're not attempting to model in detail the physics.
We're working with recorded digital speech signals: waveforms.
So this model is a model of the speech signal.
Here are the key components.
We've understood that the vocal tract has resonances: they are called formants.
We generalised that idea to one of a filter.
The filter was a linear filter: a simple equation operating in the time domain.
We can characterise that filter in various ways.
There's the coefficients of the equation, there is its frequency response, and there is its impulse response.
If we take its impulse response and excite the filter with a train of impulses, we'll get out a train of impulse responses.
That's the speech signal.
So that's our source-filter model.
Let's write down the full notation of our source-filter model.
Here's a filter; it has some number of coefficients, p, called the order of the filter.
We have to choose that value.
I'm going to change from my earlier generic notation of x for input and y for output to some meaningful letters.
I'm going to use e for 'excitation' (that's the input signal) and s for 'speech' (the output signal).
We've already established the form of the filter.
Here's a general equation; that's the same as the one we've seen before.
This part used to be written '1.0 times the input' (for which we were using x).
That's where that's gone.
This part here is the weighted sum of previous outputs.
This t-k is saying 'the previous output' when the k is 1, 2, 3, and so on on.
We use p previous outputs.
It looks a little odd to use k as the index term to count up to p, but that's the standard notation and I'm not going to deviate from that.
So far, we've only seen the model generating voiced speech.
We put in a periodic signal as the excitation: the simplest possible signal that has energy to every multiple of the fundamental, because that's what we see in speech signals.
That simple signal is an impulse train.
What we get out is the sequence of impulse responses overlaid on one another (because the filter's linear): that's our synthetic speech signal.
So it's a model of signals.
The input e is a signal: it's just a waveform.
It's indexed by discreet time t because it's a digital waveform.
e is entirely synthetic: it's something we will generate automatically.
Let's watch the model in action, in the time domain.
We simply write out this impulse train and we'll run that through this equation and get speech as the output.
An impulse train is mostly 0 and then, just occasionally, there's a 1.
That's the first impulse in our impulse train
That will excite the filter and the output of the filter will, of course, be its impulse response.
Instead of looking at this equation, let's look at the impulse response of the filter.
In comes an impulse; the filter outputs its impulse response; that's the output.
So we just write that on to the output.
Because this impulse came in at time t = 5 ms, the output impulse response will start from t = 5 ms.
On goes our input.
Nothing happens...and then, some time later (another 5 ms later, in fact), in comes the next impulse.
That also excites an impulse response from the filter, which starts at time t = 10 ms, and writes onto the output.
We can see here that the second impulse response just overlapped and added to the first impulse response.
Now, why did we just overlap-and-add that second impulse response?
Well, that's what the filter equation tells us to do.
It says that the output is just a weighted sum of the input inpulse and the previous filter outputs: it's linear.
That linear nature of the filter tells us that the output is just a sequence of overlapped-and-added impulse responses.
This whole process of taking this time domain signal and using it to provoke impulse responses and then overlap-and-adding them in the output is called 'convolution'.
That process would work for any input signal.
It could have other non-zero values anywhere.
It could be a complete waveform of any sort we like and each sample would be treated like an impulse.
It would provoke an impulse response which would write into the output.
This idea of convolution is something will come back to later.
Here we're using it to combine the excitation and the filter's impulse response to produce the filter's output.
Understanding that process as convolution in the time domain is just fine, but convolution is a slightly complicated operation, so let's go to the frequency domain where things will look a little bit simpler.
I'm going to synthesise this vowel.
To do that, I just simply need to choose appropriate values for the filter's coefficients.
Those values determine the impulse response of the filter.
If we look at the Fourier transform of this signal, we get the frequency response of the filter.
So let us now observe the source-filter model generating speech in the frequency domain.
I'll put in my impulse train: there it is, with its characteristic spectrum with a flat spectral envelope and equal amounts of energy at every multiple of F0.
This signal's going to go through the filter and produce some output.
These magnitude spectra really reveal to us how this filter is operating.
We take the magnitude spectrum of the excitation signal and it is multiplied by the frequency response of the filter to give us the magnitude spectrum of the speech.
In other words, the slightly complicated operation of convolution in the time domain has become a rather simpler operation of multiplication in the frequency domain.
This filter just has 4 coefficients - it's order 4 - and it has 2 resonances.
That's a very simple filter.
It's going to be a bit simpler than a real vocal tract.
I've excited it here with an entirely synthetic signal - an impulse train - and so we can generate a synthetic speech sound.
That's not great, but it has the properties of speech and you can probably hear that vowel.
If we wanted to make a different vowel, we can keep the input the same.
We could have different filter coefficients, leading to a different frequency response.
Here it is.
We will multiply this input magnitude spectrum by this frequency response to get the output, which looks like this.
You can see that the resonant peaks are controlled by the filter and the harmonic structure is controlled by the source.
That sounds like this.
You can perceive a different vowel.
Again, it's not very natural, because these are very simple filters with very simple input.
We can keep changing the vowel by changing the filter coefficients, and its frequency response.
Here's one more.
That looks like this.
Now, how about keeping the filter the same and changing the source?
The source only has one thing that you can change and that's the fundamental frequency.
Make it lower.
Or make it higher.
Hopefully you can perceive that only the pitch is changing and the vowel quality is the same.
We've independently controlled source and filter.
OK, that's enough vowels.
This is supposed to be a general model that could generate any speech sound.
So we better demonstrate it doing something other than vowels.
Let's make an unvoiced fricative.
To make this phoneme, we need to make a construction that creates turbulent airflow somewhere in the vocal tract.
Then the part of the vocal tract that's in front of that - between that construction and the lips - acts as the filter and shapes that basic turbulent sound to make this unvoiced fricative.
Just check that you can make this fricative.
Pause the video.
You made a constriction.
You forced air through it to create a basic, turbulent sound.
Then the remaining part of the vocal tract is the filter.
We can model that with a basic sound of turbulence.
This is a signal that is random, it has no periodicity but still has a flat spectral envelope.
We call that 'white noise' because it has an equal amount of energy at all frequencies, like white light has equal amounts of energy at all colours.
That's our simulation of the basic sound created by turbulence at a construction.
Because there's less vocal tract between the construction and the lips than for voiced speech, where the sound source is right down at the bottom of the vocal tract, the filter is of a simpler shape.
Therefore, we'd expect to see simpler spectral envelopes for filters in this case.
With this spectral envelope - this frequency response - we can put white noise through this filter and make this sound [s].
By changing the frequency response of the filter to this one, we can make a [sh].
Now, it's all very well playing around with filter coefficients and with impulse trains and white noise as input, but it's never going to sound perfectly natural, and in particular, because the filter is a bit too simple.
Everything you've heard so far was generated from hand-designed filters.
I made them.
I made up their coefficients to approximate some vowel sounds.
That's very limiting.
That's not going to be a good way to generate synthetic speech.
Wouldn't it be better if we could take natural speech, like this, and fit the filter to it?
Conceptually, that should be straightforward.
We know the natural speech waveform, and of course its magnitude spectrum.
So all we have to find is the frequency response the filter and the set of coefficients that has that frequency response.
In other words, we have to find these values here.
There are a variety of algorithms that will solve for those filter coefficients, given a natural speech signal.
In other words, that will fit the filter to that signal.
We're not going to go deep into those here.
I'm going to fit a more complicated filter.
I'm now going to ramp p up to a higher value of 24.
That's why this has a more complicated spectral envelope than we've seen so far.
But it still got peaks for resonances.
I'm still going to excite this filter with an impulse train, so that part's still completely synthetic.
This part has been fitted to a natural speech signal.
By putting a synthetic impulse train through a filter whose coefficients have been recovered from natural speech, we can make synthetic speech that'll sound a lot closer to that vowel.
The fundamental frequency is very unnatural, because it's monotonic.
But the spectral envelope is the same as that natural speech we just heard.
Now I can manipulate the fundamental frequency, whilst leaving the vowel quality alone.
I can raise the pitch and I can lower the pitch.
So, what have we achieved?
We've taken natural speech, we've fitted the source-filter model to it, in particular we solved for the filter coefficients, then we've excited that filter with synthetic impulse trains at a fundamental frequency of our choice.
We brought quite a few components together there, to make our source-filter model, but there's still a bit further we can go.
Our source-filter model decomposes speech signals into a source component (that's either an impulse train for voiced speech, or white noise for unvoiced speech) and a filter (which has a frequency response determined by its coefficients).
We've seen that we could solve for those coefficients, given natural speech samples.
From now on, whenever you encounter something about a speech signal that you don't understand, come back to the source-filter model and try and use that to understand what's going on.
We've understood our filter in the time domain and the frequency domain, and also in its coefficients - in its difference equation.
In the time domain, its output, when we put in a single impulse, is called the impulse response.
That's such a special signal, it gets its own special name in speech processing, and it's called a 'pitch period'.
That pitch period is a fragment of waveform coming out of the filter, and that's going to offer us another route to using our source-filter model to modify speech without having to solve explicitly for the filter coefficients, because the pitch period completely characterises the filter.
Eventually we're going to completely understand the process of convolution.
That's the process by which the filter's impulse response combines with the filter's input to produce the output.
That will give us another way to separate the source and filter called 'cepstral analysis'.
We'll actually use that for Automatic Speech Recognition.
The common theme that will keep coming back again and again is that speech is created from a source and a filter.
The observed speech waveform - which is all we have as our starting point in speech processing - contains the properties of both of those combined.
For various applications, we want to separate the source and filter.