Frame-based analysis

The properties of speech are constantly changing over time, so we need to analyse it in short sections, called frames.

This relates to earlier material on short-term analysis and introduces the idea of extracting salient features from each frame.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN AUTOMATIC TRANSCRIPT WITH LIGHT CORRECTIONS TO THE TECHNICAL TERMS
So why is a waveform,not a good representation for pattern recognition.
Well, we already know that from what we've looked at in the lab, so speech production has a source of sound.
It might be periodic.
It might be aperiodic, it might be noise.
It might be vocal folds
going through a filter.
And we've already hopefully realised that most of the information about the segment is not in the value of the fundamental frequency.
So the pitch doesn't carry information in many languages, and some it does carry some information, and many it doesn't.
It's mostly in the shape of the vocal tract, so the overall spectral envelope is what we're going to do pattern recognition from
and the speech waveform is a mixture of a spectral envelope.
We remember what the spectrum of a speech sound looks like.
This is frequency.
That's magnitude.
It has some sort of overall shape, and then it has all this fine structure due to the harmonics.
We could generalise that idea and we can model unvoiced sounds the same way that the overall shape and rather than harmonic structure they just got noise structure.
And it's this envelope is this shape, which we're going to do pattern recognition on the shape of the spectral envelope of one sound is like the shape of the spectral envelope of another sound we'll say these sounds belong to the same category.
That's the same phoneme
We'll do pattern recognition on that basis.
So we're going to extract the spectral envelope
now there are various ways of doing that.
If you were thinking in terms of synthesis, we think, right, we've got this fabulous source filter model, so we'll fit the source filter model to the signal.
We'll forget what the source is doing and will just look at the filter coefficients and then speaks emphasis.
Those filter coefficients for a particular form of filter might be called linear predictive coefficients.
They're just filter coefficients, and those would indeed be a reasonable candidate for our representation of speech.
We're not going to do that.
We're going to do something a little more direct, and that's gonna be actually simpler.
Computationally cheaper.
I was going to lead is up to these eventually.
These other features called Mel Frequency Cepstral coefficients.
So we already know speech sounds have a envelope
Spectral envelope on this evolves over time, so it's never, never static.
It constantly changes that some rate because our tongue is moving, articulated and moving.
But in order to do Fourier analysis, we need to make the assumption that over some short period of time, the statistical properties of speech such as the spectral envelope are constant.
And that leads us to something called frame based analysis.
We've already touched on this, so let's just formalise that it's over some short period of time.
Perhaps this amount of time we'll just assume that the vocal tract shape is static.
It doesn't change at all.
So that's the waveform is essentially perfectly periodic just repeats for some short period of time.
A little bit later on, the properties will have slightly changed.
So for some short period of time (we'll call that a frame), we can extract that piece of waveform.
We can perform some analysis on it to extract some features and those features, characterised this bit of speech here, and we'll get this thing here.
This is a vector, so I want you to be comfortable with the notation I'm going to use on the slide.
It's all gonna be pictures, but they represent mathematical things.
This is a vector, and the vector is just some numbers.
Let's write a vector here.
It might be something like this.
Just some vector of numbers.
The numbers that are the properties of this fragment of speech.
I was going to write these vectors as this
This picture here going to see lots of those.
The question is, then what do we extract on? What do we stack up within this vector? How many numbers do we need? How much speech to analyse? How far do we move on between the frames?