Filter

We now shift from an explicit physical model of the vocal tract as a resonating tube, to a more general model of the vocal tract as a filter operating on signals.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoThe vocal tract is a tube, and we can vary its shape by moving our articulators.
We're going to model that.
But, to be absolutely clear, we are not modelling the physics of the vocal tract.
It's not going to be a model of ways propagating through the air, bouncing back and forth and so on.
Rather, we're going to model the behaviour of the vocal tract.
That means we're going to model how it modifies an input sound source - for example, from the vocal folds - to turn that into speech.
That modification is a process of filtering through the resonances of the vocal tract.
We're going to build a filter that models the behaviour of the vocal tract.
Here's how filter operates.
There's an input signal.
I'm going to use an impulse train - a bit like the signal from the vocal folds.
There's a filter.
The filter takes this input and produces some output.
This particular filter is not very interesting!
In fact, it simply passes the input directly to the output.
But anyway, it's a start.
Let's write an equation that describes this filter.
We'll need some notation.
We'll call the input x; that's indexed by time t, and time is discrete.
That's why there are little points on the waveform, just to remind you.
That little point that every time sample is to show you that these are digital waveforms.
We need some notation for the output.
We can write down the equation: 'What does this filter do? How does it make y from x?'
There it is.
That's not a very exciting equation!
To make things explicit, I've written '1 times x'.
Writing things in that way is not standard mathematical notation.
So let's replace that with something that's a bit more standard.
We don't write the 'times', we just write this.
I've written 1.0 just for a bit of precision and so that matches some equations that are coming later.
So here's a trivial filter: it takes x, multiplies it by 1 and produces y.
How about something a bit less trivial?
I'm going to keep my impulse train as the input, but I'm going to use this equation for the filter.
It looks a little complicated, but really it's not.
It says that the output at time t is equal to 0.1 times the input at that time, plus 0.3 of the previous input, plus 0.5 of the one before that, plus 0.3 of the one before that, plus 0.1 of the one before that.
Since we know all the input samples at all the previous times, that's easy.
We can apply that equation.
It does the following: it produces this y from an impulse train.
To make that a bit clearer, let's zoom right in.
I've zoomed in time quite a lot here.
We're seeing one impulse in the impulse train input, in x.
We apply the equation; we get the following output.
If we changed the coefficients of this equation - if we change these values - we would change the shape here.
They're directly related to those values.
This is a very simple equation.
It just says that y is a weighted sum of the samples from x.
As is almost always the case, the frequency domain is a much better way to understand what this filter is doing.
I've plotted my impulse train and its magnitude spectrum, side-by-side.
I'm going to put that impulse train through the filter and get my y.
But I'm now going to plot the magnitude spectrum of y.
Something interesting has happened!
The impulse train has a flat spectral envelope: it has equal energy in every one of its harmonics.
y is not an impulse train.
Because it's not an impulse train, it must have a different spectral envelope.
Indeed, it does.
It has this spectral envelope.
These lower frequencies have been boosted.
These higher frequencies have been attenuated.
So even this really simple equation here, that says y is a weighted sum of previous samples from x, can do something interesting to the impulse train.
This is a kind of low-pass filtering effect.
But that form of equation is just one option.
What about making the current output sample y[t] also depend on previous output samples, because we have those available?
Here's a more interesting filter.
We'll stick with our impulse train as input.
We'll write this equation.
This part of the equation is our trivial filter from before; it says y at the current time is just equal to x at the current time.
The other part of the equation says that we're going to weight some previous samples from y and add those in.
The weights could be positive or negative, that's OK.
With weights of -0.1, +0.3 , -0.8, we're going to combine previous output samples
In this equation, if we assume that this value here always has to be 1, then there are three coefficients available for us to play with.
With this particular set of coefficients, we'll get this output.
That's doing something much more interesting than the previous filter did.
With just those three coefficients, we can produce this very interesting behaviour.
There's some oscillating behaviour.
For each input impulse, we get an oscillating signal output.
Then we put in another impulse a little bit later on and get another oscillating output.
That oscillating is at some frequency and that frequency will be governed by the values of the coefficients in this equation.
Again, always better to look in the frequency domain.
Here we've got the time domain and frequency domain side-by-side.
Our impulse train has its flat spectral envelope, but our output now has this characteristic peak.
In other words, our filter has a resonance.
This simple equation is a resonator: it produces, from impulses, this oscillating behaviour.
This is starting to look a little bit like speech.
It's got the two most important properties of speech.
In the time domain, there is some periodicity, which is related to the vocal folds.
Within each of those periods, there's interesting oscillating behaviour.
In the frequency domain, there is a peak - a resonance, and that's related to what's happening inside those periods.
There's also this line structure and that's related to the periodicity of the source.
There's the output of our filter when we put in an impulse train.
Each impulse provokes this resonating, oscillating behaviour.
Sometimes we can call that 'ringing'.
That ringing decays away, and then the next impulse excites another ringing behaviour.
Zoom in and take a look at one of those.
This is the output of the filter in response to a single input impulse.
This reminds me of other resonant objects.
How about the swing?
If we take a swing and we give it a single push - just one impulse - it will swing backwards and forwards, but with slowly decaying amplitude: the energy is dissipated.
If we plotted the movement of the swing, it would look something like this.
We're working towards a complete computational model that can generate any speech signal.
Now we're ready to build a filter that models how the vocal tract behaves, as part of that complete computational model.
We established that the vocal tract has a number of resonant frequencies called formants.
So we need a filter that has a number of resonances.
We need to be able to choose how many, and we need to be able to control their frequencies.
The last form of the filter that we saw has what we need.
We just need to increase its complexity so that it can have more than one resonance.
We need to choose the right values for the coefficients, so that it has the right frequencies of those resonances.
How about this equation here?
It's got more coefficients than the previous one.
We're always going to assume this one here is fixed.
So this has got one, two, three, four coefficients.
With four coefficients, we can get two resonances.
This is the response of this filter to an impulse train input.
In the time domain, we can see that oscillating behaviour in response to each input impulse.
But it's much more obvious in the frequency domain that there are two resonant peaks.
The frequencies of those resonances are controlled by the values of these coefficients.
The exact relationship between the values of the coefficients and the frequencies of these two formants is a little complicated.
It doesn't really matter at this point.
All we need to know is that we can vary those four numbers and change the resonant frequencies.
I've done that here.
That's one set of coefficients, and here's three more.
The only difference between these four plots is that I've changed the coefficients in that equation.
All of them have the same input impulse train and all of them generate a synthetic speech-like signal.
You can see that they all have different spectral envelopes: the peaks are in different places.
I'm not going to play these ones just yet, because we haven't quite developed the full model.
What we've done so far is to take care of the filter.
We've found a mathematical equation - a really, really simple equation with just a few coefficients - that has the property of resonance and can have multiple resonances.
That's going to model the vocal tract for us.
We're going to examine in a little bit more detail its response to a single impulse, because it's important to understand.
We're going to then take a train of such impulses, put them into the filter, and generate speech.
That will be the complete model.
There will be a source (such as an impulse train), a filter (with an equation in the form that we just saw), and that will be our model of speech: the source-filter model.