Spectral envelope estimation

Until now, we have conflated the vocal tract frequency response with the spectral envelope. We now take a strictly signal-based view of speech, and define the spectral envelope more carefully.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS.
The other important aspect ofthe speak signal to extract was we've got zero Is thie spectral envelope? Let's just remind ourselves again.
We are now moving on from the idea of an explicit source philtre model where we imagine the source of the vocal folds on the philtre is the vocal tract to a more abstract model.
Still the idea of source and philtre, but now strictly fitted to the signal as we observe it on DH.
We don't care so much if we get the truce.
Also the true philtre, so long as we can achieve our aims off manipulation or modification or whatever they might be.
So we're now try and get out this spectral envelope.
It might be something that joins it, the tops of all the harmonics on DH.
We need to worry about how to get that without accidentally also fitting our envelope to the detail, the harmonics, which is a source feature that we don't want.
We can already see that here in the FT spectrum.
So this is just computed through a Fourier transform on DH.
We get both source and philtre features.
We get both envelope on detail and now we want to do is to get rid of the detail, to get rid of these harmonics on, only to get this envelope.
Some techniques for doing that make an assumption about the form of the philtre, such as linear prediction, which assumes it's an all poll philtre.
Fix that to the signal on.
By controlling the complexity of that philtre, it was by not giving it to many coefficients.
We make sure we fit on it, the envelope and not to the detail.
We're not going to do that here.
We're going to do something more directly.
Signal based.
We're going to follow, along with a famous vocoder that's widely used in parliamentary synthesis, called straight from Kawahara.
In this paper, they state that it's important to pick the correct analysis window size well, they say the following if the time window that we use for spectral analysis on remember that computing This involves taking a frame of speech and choosing the duration of that frame and putting it through the Fourier transform.
If the duration ofthe that analysis, frame or window is comparable to the fundamental period D 01 of absolute, then the para spectrum varies in the time domain.
But if the time window for analysis spans many pitch periods, we get period variation in the frequency domain, and that's just stating formally something we actually already know.
So go on, get to speak signal and open it up in your favourite speech at it a prat Or in this case, I've used waves ever on DH.
Try changing the size of the analysis window whilst you're calculating a spectrum.
So in wave surfer, we do that with these controls on DH.
If we set that analysis window size relatively small, so it becomes comparable in size to the fundamental period one or two fundamental periods what we see in the time domain in the Spectra Graham.
So going along the time axis, we see variation nautical fluctuation in the paper.
We see these striations this way.
These are the individual pitch.
Pulse is the box because the analysis window is so small as it slides forward across the way form the power off the way form.
Inside that window goes up and down, period by period.
Sometimes you have a lot of energy, sometimes have less energy, and that's what we're seeing in the Spectra.
Graham without fluctuations in darkness.
So a very short window we already knew.
This gives us a very good time resolution, but relatively poor frequency resolution Kandahar at are also reminders that when we use a very long analysis window, maybe a long window like this with many pitch periods inside it, we no longer see that fluctuation in the time domain because now the average amount of energy falling inside that window is very much constant or nearly constant.
It doesn't go open down with individual pitch periods falling inside or not inside the window because along analysis window, we get very good frequency resolution.
That's that access.
And so now we don't see those vertical striations.
We see these horizontal striations, and that's what Long and arses windows.
That is what we're getting in this view here.
So this FFT spectrum was clearly calculated with a relatively long analysis window because resolving the individual harmonics when viewed on a spectra graham, those air those horizontal striations so straight uses this insight to do something pretty clever.
Straight sets its analysis window the thing just before the Fourier transform in a size that adaptive to the fundamental period conductive to F zero.
So by varying the size of the analysis window with zero for example, making it exactly to pitch periods in duration, we ensure that amount of energy falling inside that window is fairly constant as the window slides across the way.
Former two fixed frame rate on DH.
If we make that allows his window what Kawahara calls comparable to the fundamental period.
For example, twice the fundamental period.
We won't get that high frequency resolution, which resolves the harmonics, which we don't want.
The straight basically, does this clever trick often f zero adaptive window that minimises the interference between harmonics on the final extracted spectral envelope.
Then that's just standard.
A 50 analysis on.
Then it does some smoothing to remove any remaining interference from the source.
As we'll see a little bit later, there is more to speak parameters F zero the spectral envelope because we've been neglecting the non periodic energy.
What do we do with Fricka? Tibbs, for example? So straight promised rises That, of course, and it does that by estimating a radio between periodic and AP.
Erotic energy will come on to that in a moment when we've done all that, We'll have a complete parameter isation of speech signal, and by complete I mean, we could reconstruct the way form from it on.
So we'll just need to see how to do that in the synthesis phase.
So, like many vocoder is weaken break straight into analysis phase and the synthesis phase on.
If we were on one after the other, sometimes that's called copy synthesis or just analysis synthesis, and that will just give us a vote coded speak signal.
Are we going to see that we're going to break it into two parts, do analysis and then statistical modelling on, then use a synthesis face to regenerate away form when we're synthesising unseen sentences.
But that's coming later.
Let's just cheque that Kawahara is right, which, of course he is.
If we change the analysis window size again, I did this in wave surface, and you can try this for yourself.
We'll see the effect ofthe reducing interference from the source.
So here I've got my time to Main Way form so that this time Andi, it's after 16.
Something frequency on.
I've taken analysis window, which have drawn away from their off about to pitch periods and remember that tapered wind was applied before analysis, which is why we don't make it one pitch period So this thing will be faded in and faded out.
And then that will move forward in fixed steps.
Say, every five milliseconds.
Finalising that window that's about to pitch periods long gives me this result, and we can see here there's little or no evidence of the harmonics.
We've got essentially the envelope.
It's not particularly smooth, but we could easily fix that with some simple moving average or median smoothing or something like that.
If we took a longer than I was this window.
So if I take one that's four times longer, perhaps something like this, then I would get this result.
And now I'm resolving the harmonics.
So I've got interference from the source, and there's not this move Special envelope.
So what Straight does is choose Windows sizes off about this size and gets this result from the fast Fourier transform to which it does a little bit of extra smoothing.
To get the smooth, spectral envelope out in the next step, we're going to see that we need to parameter rise this smooth spectral envelope for various reasons, including reducing its dimensionality stand away.
To do that is with the caps from on a male scale.
So milk actual analysis on this figure demonstrates what happens if we do that directly on their 50 spectrum.
First is doing it on the straight smooth spectral envelope.
Do we take the FFT spectrum on what the frequency, skeletal male scale and the new cultural analysis So we're using a rather large number of capital coefficients will fit in detail to all of the harmonics.
As we can see down here on, the reason that will fit better at low frequencies than high frequency is because they're doing this on a warp frequency scale.
So our resolution is reduced up in these high frequencies.
However, if we first use straight to get the spectral envelope that's free from interference from the harmonics and then apply this Mel capture analysis, we'll get this red envelope and we can see that that's relatively independent of the harmonics that might have some other problems.
Maybe it's a little bit too smooth.
It doesn't capture every little bit of detail, but it will be independent, relatively speaking, in a statistical sense, from the value of zero.
And that's really what we want because we want a representation of speech parameter realisation in which we decompose the signal into those things that relate to property such as F zero.
Those things that relate to filleting identity, such as the special envelope on so we could model them separately on weaken independently, manipulate them if you want one reason for motivating independent parameters for Prodi on forfeiting identity, think back to unit selection on the Spar City issue.
Often independent feature formulation.
Target cost.
If we moved to an acoustic space formulation, we can escape a lot of that's Par City.
And if in that acoustic space, we can treat independently fundamental frequency and spectral envelope, weaken further combat capacity because we can re combine them in different ways.
In other words, if we can separate these things and put them back together in different combinations, we don't need in our database every speak sounded every context at every possible value.
F zero.
But the main reason for looking at this straight spectral envelope on the way with parameter rise if modelling, is to move forward on to statistical Parametric speech synthesis and so the next step is to take what we've learned in speak parameter isation and think about how we're going to represent those parameters before we try and model them.
So what's coming next, then, is to think about representation z off the speech parameters that are suitable for statistical modelling on DH.
Whilst we're going through that will have in our mind hidden Markov model with a Gaussian Probability density function as our key statistical model.