Short-term analysis

Because speech sounds change over time, we need to analyse only short regions of the signal. We convert the speech signal into a sequence of frames.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoShort-term analysis is the first step that takes us out of the time domain and into some other domain, such as the frequency domain.
Here's a spoken word.
Clearly, its properties vary over time.
The amplitude varies.
Or, for example, some parts are voiced: this is voiced, this is voiced.
But other parts are not: this part's unvoiced.
Apart from measuring the total duration, it makes no sense to analyse any other properties of a whole utterance.
For example, F0 doesn't just vary over time, it only exists in the voiced regions and doesn't even exist in the unvoiced parts.
Because short-term analysis is the first step, in general we need to perform it without knowing anything about the waveform.
For example, in Automatic Speech Recognition, the analysis takes place before we know what words have been spoken.
So we can't do the following: we can't segment the speech into linguistically-meaningful units and then perform some specific analysis, for example, on this voiced fricative, or this vowel, or this unvoiced fricative.
Rather, we need to use a general-purpose method, which doesn't require any knowledge of the contents of the signal.
To do that, we're going to just divide the signal into uniform regions and analyse each one separately.
These regions are called frames and they have a fixed duration, and that duration is something we have to choose.
Here's the plan: we'll take our whole utterance and we'll zoom into some shorter region of that and perform some analysis.
Then we shift forward in time, analyse that region, then move forward again, analyse that region, and so on, working from the start of the utterance to the end in some fixed steps.
The first thing to decide is how much waveform to analyse at any one time.
The waveform in front of you clearly substantially varies its properties, so we need a shorter region than that.
We'll define a frame of the waveform first by choosing a window function and then multiplying the waveform by this window function.
My window here is the simplest possible one: a rectangular window that is zero everywhere except within the frame I wish to analyse, where it's got a value of one.
We multiply the two, sample by sample, and obtain a frame of waveform that's - if you like - "cut out" of the whole utterance.
This cut-out fragment of waveform is called a frame.
We're then going to move forward a little bit in time and cut out another frame for analysis.
So here's the process:
Cut out of frame of waveform: that's ready for some subsequent analysis.
Move forward some fixed amount in time, cut out another frame, and so on to get a sequence of frames cut out of this waveform.
That's done simply by sliding the window function across the waveform.
Let's take a closer look at one frame of the waveform.
Because I've used the simplest possible rectangular window, we've accidentally introduced something into the signal that wasn't there in the original.
That's the sudden changes at the edge of the signal.
These are artefacts: that means something we introduced by our processing, that's not part of the original signal.
If we analysed this signal we'd not only be analysing the speech but also those artefacts.
So we don't generally use rectangular window functions because these artefacts are bad, but rather we use tapered windows.
When we cut out a frame, it doesn't look like this, but it's cut out with a window function that tapers towards the edges.
Think of that as a fade-in and a fade-out.
That gives us a frame of waveform that looks like this: it doesn't have those sudden discontinuities at the edges.
Here's the complete process of extracting frames from a longer waveform using a tapered window.
Typical values for speech will be a frame duration of 25 ms and a frame shift off something less than that, and that's because we're using these tapered windows.
To avoid losing any waveform, we need to overlap the analysis frames.
So, we'll extract one frame, then the next, and the next, each one having a duration of 25 ms and each one being 10 ms further on in time than the previous one.
We've converted our complete utterance into a sequence of frames.
This representation - the sequence of frames - is the basis for almost every possible subsequent analysis that we might perform on a speech signal, whether that's estimating F0 or extracting features for Automatic Speech Recognition.
With a speech utterance broken down now into a sequence of frames, we're ready to actually do some analysis.
Our first, and most important, destination is the frequency domain, which escapes many of the limitations of doing analysis directly in the time domain.
Now, to get to the frequency domain, we're going to use Fourier analysis and that will be introduced actually in two stages.
First we'll have the rather abstract concept of Series Expansion, and then we'll use that to explain Fourier analysis itself.