Videos

Total video to watch in this section: 62 minutes. The first four videos (Time domain, sound source, periodic signal, pitch) overlap with the Module 2 videos but were made by Simon King from a more engineering perspective. You may find them useful for consolidation. The core of this module starts with Digital Signals.

Time domain

Sound is a wave of pressure travelling through a medium, such as air. We can plot the variation in pressure against time to visualise the waveform.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoSound is a wave and it has to travel in a medium.
Here the medium's air.
So there's air in this space.
Sound is a pressure wave, so a wave of pressure is going to travel through this medium.
Let's make a simple sound: a hand clap.
When we do that, our hands trap some air between them.
That compresses the air: the pressure increases.
Then it escapes as a pulse of higher pressure air.
We can draw a picture of that high pressure air propagating as a wave through the medium.
That red line is indicating a higher pressure region of air.
So, this is our first representation of sound, its propagation through physical space where sound travels at a constant speed.
In air, that speed is about 340 metres per second, which means it takes about 3 seconds to travel a kilometre.
But rather than diagrams like this - of sound waves propagating through space and then disappearing - it's much more informative to make a record of that sound.
We can do that by picking a single point in space and measuring the variation in pressure at that point over time.
We make that measurement with a device and that device is a microphone.
So let's use a microphone to measure the pressure variation at a single point in space and then plot that variation against time.
So a plot needs some axes.
Here, the horizontal axis will be time and the vertical axis will be the amplitude of the pressure variation.
It's very important to label the axes of any plot with both the quantity being measured and its units.
This axis is 'time', so we label it with that quantity: time.
Time has only one unit.
The scientific unit of time is the second and that's written with just 's'.
On the vertical axis, we're going to measure the amplitude of the variation in pressure.
So I've put the quantity 'amplitude' and 0 is the ambient pressure.
But we don't actually have any units on this axis.
That's simply because our microphone normally is not a calibrated scientific instrument.
It just measures the relative variation in pressure and converts that into an electrical signal that is proportional to the pressure variation.
So we just mark the 0 amplitude point but don't normally specify any units.
Now we can make the measurement of our sound.
As a sound wave passes the microphone, the pressure at that point rises to be higher than normal and then drops to be lower than normal, and eventually settles back to the ambient pressure of the surrounding air.
Let's plot the output of the microphone and listen to the signal the microphone is now recording.
We're going to take the output of this microphone and we're going to record this signal - this electrical signal - on this plot.
Here's the plot we just made.
The plot is called a waveform and this is our first actually useful representation of sound.
This representation is in the time domain because the horizontal axis of the plot is time.
Later, we'll discover other domains in which we can represent sound, and we'll plot those using different axes.
The waveform is useful for examining some properties of sound.
For example, here's a waveform of a bell sound.
We can see that, for example, the amplitude is clearly decaying over time.
This is a waveform of speech: 'voice'.
It's the word 'voice' and some of the things we can measure from this waveform would be, again that the amplitude is varying over time in some interesting way, and that this word has some duration.
We could enlarge the scale to see a little bit more detail.
This particular part of the waveform has something quite interesting going on.
It clearly has a repeating pattern; that looks like it's going to be be important to understand.
But in contrast, let's look at some other part of this waveform.
Maybe this part here.
It doesn't matter how much you zoom in here, you won't find any repeating pattern.
This is less structured: it's a bit more random.
That's also going to be important to understand.
So far, we've talked about directly plotting the output from a microphone.
Microphones essentially produce an electrical signal: a voltage.
That's an analogue signal: it's proportional to the pressure that's being measured.
But actually we're going to do all of our speech processing with a computer.
Computers can't store analogue signals: they are digital devices.
We're going to need to understand how to represent an analogue signal from a microphone as a digital signal that we can store and process in a computer.
We also already saw that sounds vary over time.
In fact, speech has to vary, because it's carrying a message.
So we'll need to analyse not whole utterances of speech, but just parts of the signal over short periods of time.
Speech varies over time for many reasons, and that's controlled by how it's produced.
So we need to look into speech production, and the first aspect of that that we need to understand is 'What is the original source of sound when we make speech?'

Sound source

Air flow from the lungs is the power source for generating a basic source of sound either using the vocal folds or at a constriction made anywhere in the vocal tract.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoWe've seen speech already in the time domain by looking at the waveform.
But how is that speech made?
Well, we need some basic sound source and some way to modify that basic sound source.
The modification, for example, might make one vowel sound different from another vowel sound.
Here we're just going to look at the source of sound, and we'll see two possible sources that can make speech.
Here's someone talking.
He has a vocal tract; that also happens to be useful for breathing and eating, but here we're talking about speaking.
That's just a tube.
For our purposes, it doesn't matter that that's curved.
That's just to fit in our body.
We can think of it as a simple tube, like this.
So here it is, a simplified vocal tract.
At the top here, the lips; at the bottom, the lungs.
The lungs are going to power our sound source.
Airflow from the lungs comes into the vocal tract.
We can block the flow of air with a special piece of our anatomy called the vocal folds.
There they are.
As air keeps flowing from the lungs, the pressure will increase below the vocal folds.
We will get more and more air molecules packed into this tight space.
More tightly packed molecules in the same volume means an increase in pressure.
That's what pressure is: it's the force molecules exert on each other and on their container.
Eventually, the pressure is enough to force its way through the blockage, and the vocal folds burst open.
The higher pressure air from below moves up.
So we get a pulse of higher pressure air bursting through the vocal folds.
That releases the pressure below the vocal folds and they will close again.
Now have a situation where there is a small region of higher pressure air just here, surrounded by lower pressure air everywhere else.
That's obviously not a stable situation.
This higher pressure air exerts a force on the neighbouring air and a wave of pressure moves up through the vocal tract.
It's important to understand that this wave of pressure is moving at the speed of sound, and that's quite different from the gentle air flow from your lungs: your breathing out.
You don't breathe out at the speed of sound!
Breathing is just the power source for the vocal folds.
The air flow in the vocal tract is much, much slower than the propagation of the pressure wave.
So we can neglect the airflow and just think about this pressure wave moving through air.
A pulse of high pressure has just been released by the vocal folds.
Let's make a measurement of that.
Imagine we could put a microphone just above the vocal folds and measure the pressure there.
The plot might look something like this: an increase in pressure as the pulse escapes, a dip as the pulse moves away, and then a gradual settling back to the ambient pressure.
We've created sound!
Sound is a variation in the pressure of air.
Let's listen to that one pulse - that glottal pulse.
Listen carefully because it's going to be very short.
Just sounds like a click.
Let's do that again.
That's the sound of a glottal pulse created in the glottis.
The glottis is a funny thing.
It's the anatomical name for the gap between the vocal folds.
Of course, if the lungs keep pushing air, the pressure will build up again.
After some short period of time, the vocal folds will burst open again and we'll get another pulse.
That will repeat for as long as the air is being pushed by the lungs.
Remember the lungs are the power source of the system
The actual signal will be a repeating sequence of pulses.
I'm going to play this pulse now, not in isolation, but I'm going to play it 100 times per second.
It sounds like this.
Well, it's not speech, but it's a start.
For our purposes, which eventually are going to be to build a model of speech production that we can use various things, the actual shape of the pulse turns out to be not very important.
Let's try simplifying that down to the simplest possible pulse.
That's this signal here, that is zero everywhere and goes up to a maximum value instantaneously and then back down again.
Let's listen to that.
Again, listen carefully.
It sounds pretty similar to the other pulse, just like a click.
We can play a rapid succession of such clicks.
Let's start with a very slow rate of just 10 per second.
Perceptually that's still just a sequence of individual clicks, so I'll increase the rate now to 40 per second.
I can't quite make out individual clicks now.
It's starting to sound like a continuous sound.
If we go up to 100 per second, it's definitely a continuous buzzing sound.
So, although we're talking about speech production here, we've learned something interesting about speech perception already: that once the rate of these pulses is high enough, we no longer hear individual clicks but we integrate that into a continuous sound.
This pulse train signal is going to be a key building block for us.
It's going to be initially just for understanding speech.
That's what we're doing at the moment.
We're going to use it later actually, as the starting point for generating synthetic speech.
There are other sources of sound.
We will just cover the second most important one, after voicing.
Again, here airflow from the lungs is the power source.
But this time, instead of completely blocking the flow at the vocal folds (which are at the bottom of the vocal tract) we'll force the airflow through a narrow gap somewhere in the vocal tract.
So let's make that constriction.
Air flows up from the lungs, and it's forced through this narrow gap.
When we force air through a narrow gap, it becomes turbulent.
The airflow becomes chaotic and random, and that means that the air pressure is varying chaotically and randomly.
And since sound is nothing more than pressure variation, that means we've generated sound!
So again, if we put a microphone just after that construction and recorded that signal created by that chaotic, turbulent airflow, it looks something like this: random and without any discernible structure.
Certainly no repeating pattern.
That signal would sound like this.
It's noise.
Why don't you try making a narrow construction somewhere in your vocal tract and push air through it from your lungs? I wonder how many different sounds you could make that way.
You can change the sound by putting the constriction in a different place.
I'll give you a few to start with.
I'm sure you can come with many more.
These then are the two principal sources of sound in speech.
On the left, voicing.
That's the regular vibration of the vocal folds.
On the right, frication, which is the sound caused by turbulent airflow at a narrow constriction somewhere in the vocal tract.
There are few other ways of making sound, but we don't really need them at this point.
These are going to be enough for a model of speech that will be able to generate any speech sound.
We saw everything in the time domain here.
We plotted lots of waveforms.
We've been talking about the sound source, and we now know why a speech waveform sometimes has a repeating pattern.
It's because the sound source itself was repeating.
We call such signals 'periodic', and you'll find that whenever there's a repeating pattern in the waveform, that can only be caused by voicing: the periodic vibration of the vocal folds.
Whenever there is voicing, you will also perceive a pitch.
Perhaps you could call that a musical note or a tone.
Pitch is controlled by the speaker's rate of vibration of the vocal folds.
So we could use pitch to help convey a message in speech.

Periodic signal

The vocal folds block air flow from the lungs, burst open under pressure to create a glottal pulse, then rapidly close. This repeats, creating a periodic signal.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoThe most important source of sound in speech is the periodic vibration of the vocal folds.
Here's a reminder of the two principal sources of sound when producing speech.
Now let's introduce some engineering terms to describe their general properties.
On the left we have voicing.
That's the phonetic term: voicing means the vibration of the vocal folds and that results in a periodic signal.
Periodic signals are predictable.
You could continue the plot on the left and tell me what happens next.
We call this type of signal 'deterministic'.
On the other hand, frication results in a signal that has no periodicity: it's very unpredictable.
So we could say that that is 'aperiodic' or 'non-periodic'.
(They mean the same thing.)
Aperiodic signals are not predictable.
You cannot guess what happens next in the plot on the right.
So we could use some engineering terms.
Periodic signals are 'deterministic': we know what happens next.
Aperiodic or non-periodic signals are 'stochastic': we don't know what happens next, they're random.
Periodic signals are so important, we're going to take a closer look now.
All periodic signals have a repeating pattern.
In this particular signalm it's really obvious what that repeating pattern is.
We can also see exactly how often it repeats: every 0.01 s or 1/100th of a second.
We can use some notation to denote that.
We use the term T0 to denote the fundamental period.
T0 has a dimension of time and a unit of seconds.
Here we can see what T0 is.
It's the time it takes for this signal to repeat.
Here's another signal.
This is a very special signal called a sine wave.
This one has a fundamental period of 0.1 s or 1/10th of a second.
So if we repeated this cycle 10 times, we'd fill up a duration of exactly 1 s.
Another way of talking about the signal is, instead of saying that the fundamental period is 0.1 s, we can say that it has a fundamental frequency of 10.
We'll use the notation F0 to denote fundamental frequency, and that's just going to be equal to 1/T0.
But what are the units of frequency?
10 is the number of periods per second and the units of time are seconds.
The units of frequency are 1/s (1 over seconds) or, in scientific notation, 'seconds to the minus 1'.
That's a little bit of an awkward unit.
Since frequency is so important, we don't normally write down this unit, which we could say out loud as 'per second'.
We give it its own units of Hertz.
These are all equivalent.
The scientific unit of frequency is Hertz.
But just remember, it always means precisely the same as '1 over seconds' or 'seconds to the minus 1'.
The old fashioned unit of frequency was actually very helpfully called 'cycles per second', but we don't use that anymore.
So what are the fundamental periods and the fundamental frequencies of these signals?
Sit down and work them out while you pause the video.
I hope you remembered to always give the units.
In the top row, we've got 10 cycles in one second.
So that's a fundamental period of 0.1 s and we've got then an F0 of 10 Hz.
I hope you got that right, with the units.
Top right, we've got a fundamental period of 0.01 and that gives us a frequency of 100.
But always write the units!
T0 = 0.01 s and F0 = 100 Hz.
Down on the bottom left, we've got a much higher frequency signal.
That's got a fundamental period of 0.0005 s and then it's got a fundamental frequency of 2000 Hz.
We could now use some more scientific notation because once we're into the thousands we could start saying a multiplier for the Hz.
Instead of writing 2000 Hz, we could write 2 kHz.
Those are the same thing.
Bottom right, it's a little bit trickier.
This is a speech signal, but it's pretty obvious what the fundamental period is here.
We can see a clear repeating pattern, and it's going to be from here to here, and so T0 is pretty close to 0.005 s to give us an F0 of 200 Hz.
Periodic signals are very important as a sound source in speech.
Thinking about speech perception and about getting a deeper understanding of speech as a means of communicating a message, we'll find that periodic signals are perceived as having a pitch, or a musical tone, and that can be employed by speakers to convey part of the message.
Pitch is part of a collection of other acoustic features that speakers use, which collectively we call prosody.
Thinking on the other hand about signal processing and about getting a deeper understanding of speech signals, for example, so we can make a model of them, we're going to need to move out of the time domain and into the frequency domain where will see that this very special periodic nature in the time domain has an equally-special, distinctive property in the frequency domain, and that is harmonics.

Pitch

Periodic signals are perceived as having a pitch. The physical property of fundamental frequency relates to the perceptual quantity of pitch.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoPeriodic signals have a very important perceptual property of pitch.
That means that periodic signals are perceived as having a musical note: a tone.
Here are some signals that are periodic.
They all have a repeating pattern.
And so we predict, just by looking at them, that when we listen to them, there will be a pitch to perceive.
There's the simplified glottal waveform we've seen before.
Well, not very pleasant, but it certainly has a pitch.
Here's a sine wave: that's a very pure, simple sound, again, with a very clearly perceived pitch.
Finally a short clip of a spoken vowel.
I'll play that again.
Again, a clear pitch can be perceived.
Pitch is a perceptual phenomenon.
We need to establish the relationship between the periodicity, a physical signal property of F0 (fundamental frequencyP and this perceptual property of pitch.
Let's do that by listening to some sine wave, some pure tones.
I'll play one at 220 Hz and then I'll play one at 440 Hz.
Hopefully, you have a musical enough ear to here that's an octave.
There's a clear musical relationship between the two.
The second one is perceived as having twice the pitch of the first.
So let's go up another 220 and see what happens.
No, that's definitely not in octave!
You don't need to be a musician to know that.
So let's go up again.
That sounds like it might be an octave above 440.
So let's listen to octaves.
We've discovered something really important: that the relationship between the physical signal property F0 and the perceptual property pitch is not linear.
To perceive the same interval change in pitch - an octave - we don't need to add a fixed amount to the frequency: we need to double the frequency.
So this relationship between F0 on pitch is actually logarithmic.
It's non-linear.
That non-linearity is one aspect of a much more general property of our auditory system as a whole.
It is, in general, non-linear.
We can probably make use of that knowledge later on.
So for speech, where the pitch is varying in interesting ways because it might be carrying part of the message, we would need to measure the local value of F0 and then plot how that changes against time.
Now, because there's a very simple relationship between F0 and pitch, you'll find the two terms actually used interchangeably in our field.
But that's not technically correct!
They are not the same thing.
F0 is a physical property: it's the rate of vibration of the vocal folds.
We could measure that if we had access to the speaker's vocal folds.
Or we could estimate it automatically from a signal.
Here's some software that will do that.
It's called Praat.
Other software can also do the same thing.
It will make that measurement of F0 for you.
In fact, Praat calls it pitch, even though it's estimating at F0!
But it's very important to remember the software does not have access to the speaker's vocal folds.
It can only estimate F0 from the speech signal, using some algorithm.
That's a non-trivial estimation, so you must always be aware that there will be errors in the output of any F0 estimation algorithm.
This is not truth: this is an estimate.
'Nothing's impossible'
The term pitch really then is about the perceptual phenomenon.
It only exists in the mind of a listener, and so to do experiments about pitch would have to involve humans listening to speech.
Experiments about F0 could be done on speech signals analytically.
So speakers can control the fundamental frequency as well as the duration and the amplitude of the speech sounds they produce.
They can use all of those acoustic properties - and others - to convey parts of the message to a listener.
We use the term 'prosody' to refer to collectively the fundamental frequency, the duration, and the amplitude of speech sounds (sometimes also voice quality).
Later, then, when we attempt to generate synthetic speech, we'll have to give it an appropriate prosody if we want it to sound natural.

Digital signal

To do speech processing with a computer, we need to convert sound first to an analogue electrical signal, and then to a digital representation of that signal.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoThe very first step in processing speech is to capture the signal in the time domain and create a digital version of it in our computer.
Here's a microphone being used to measure a sound wave.
The output of the microphone is an analogue voltage.
We need to convert that to a digital signal so we can both store it and then manipulate it with a computer.
How do we convert an analogue signal to a digital one?
In other words, what is happening here?
In the analogue domain, things move smoothly and continuously.
Look at the hands on this analogue watch, for example.
It tells the time with infinite precision, not just to the nearest second.
But in stark contrast, in the digital domain, there are a fixed number of values that something can take: there is finite precision.
So this digital clock does only tell the time to the nearest second.
It has made time discrete, and that's an approximation of reality.
So why cannot computers store analogue values?
It's because computers only store binary numbers, nothing else.
Everything has to be represented as a binary number.
It has to be placed in the finite amount of storage available inside the computer.
So, for our waveform, there are two implications of that.
1) we have to represent the amplitude of the waveform with some fixed precision, because it's going to have to be a binary number.
2) we can only store that amplitude a finite number of times per second, otherwise we would need infinite storage.
Start by considering these binary numbers.
With one bit, we have two possible values.
With two bits, we get four values.
With three bits, we get eight values, and so on.
The amplitude of our waveform has to be stored as a binary number.
But let's first consider making time digital: making time discrete.
Let's zoom into this speech waveform.
It appears to be smooth and continuous, but zoom in some more, and keep zooming in, and eventually we'll see that this waveform has discrete samples.
The line joining up the points on this plot is just to make it pretty; it's to help you see the waveform.
In reality, the amplitude is only stored at these fixed time intervals.
Each point in this plot is a sample of the waveform.
Let's first decide how frequently we should sample the waveform.
I'm drawing this sine wave in the usual way with a line joining up the individual samples, and you can't see those samples, so I'll put a point on each sample.
This is sampled so frequently, we can barely see the individual points.
But let's reduce the sampling rate.
There are fewer samples per second, and now you can see the individual samples.
Remember, the line is just a visual aid: the waveform's value is defined only at those sample points.
Keep reducing the sampling rate, and that's as far as we can go.
If we go any lower than this, we won't be able to store the sine wave.
It won't go up and down once per cycle
We have discovered that, to store a particular frequency, we need to have a least two samples per cycle of the waveform.
Another way of saying that is: the highest frequency that we can capture is half of the sampling frequency.
That's a very special value, so special it has a name, and it's called the Nyquist frequency.
A digital waveform cannot contain any frequencies above the Nyquist frequency, and the Nyquist frequency is just half the sampling frequency.
But what would happen then, if we did try to sample a signal whose frequency is higher than the Nyquist frequency?
Here's a sine wave and let's take samples of it less often than the Nyquist frequency.
To make it easier to see what's happening, I'm going to draw a line between these points.
This waveform doesn't look anything like the original sine wave!
We've created a new signal that's definitely not a faithful representation of the sine wave
This effect of creating a new frequency, which is related to the original signal and to the sampling frequency, is called aliasing.
It's something to be avoided!
Whenever we sample on analogue signal, we must first remove all frequencies above the Nyquist frequency, otherwise we'll get aliasing.
We must also do that if we take a digital signal like this one on reduce its sampling frequency.
Let's listen to the effect of sampling frequency.
These are all correctly-sampled signals.
We've removed everything below the Nyquist frequency before changing the sampling rate.
For speech, a sampling rate of 16 kHz is adequate, and that sounds fine.
Let's listen to reducing the sampling rate.
We've lost some of the high frequencies.
We've lost even more of the high frequencies
And even more of them.
Even at this very low sampling rate of 4 kHz, speech is still intelligible.
We can still perceive pitch, but we've lost some of the sounds.
The fricatives are starting to go because they're at higher frequencies.
Hopefully, you've noticed that I've been using a few different terms interchangeably.
I've said 'sampling frequency', I've said 'sampling rate', or perhaps just 'sample rate'.
Those are all interchangeable terms that mean the same thing.
So we've dealt with making time discrete.
That's the most important decision: to choose the sampling rate.
For Automatic Speech Recognition, 16 kHz will be just fine, but for Speech Synthesis typically we'd use a higher sampling rate than that.
Let's turn over to making amplitude digital or amplitude discrete.
Here's waveform that I've sampled: I've chosen the sampling rate and we have samples evenly spaced in time.
Now I've got to write down the value of each sample, and I've got to write that down as a binary number, and that means I have to choose how many bits to use for that binary number.
Maybe I'll choose to use two bits, and that will give me four levels.
So each one of these samples would just have to be stored as the nearest available value: that's called quantisation.
We need to choose a bit depth, but there is a very common value, and that's 16 bits per sample, and that gives us 2 to the power 16 available discrete levels.
We have to use them to span both the negative and positive parts of the amplitude axis.
So just sometimes in some pieces of software, you might see the amplitude axis labelled with a sample value.
That would go from -32,768 up to +32,767 because one of the values has to be zero.
The number of bits used to store each sample is called the bit depth.
Let's listen to the effect of changing the bit depth, and in particular reducing it from this most common value of 16 bits to some smaller value.
That sounds absolutely fine.
That sounds pretty good.
Listen on headphones, and you might hear small differences.
That sounds pretty nasty.
Brace yourself: we're going down to two bits...
Pretty horrible!
It's quite different though to the effect of changing the sampling frequency.
Reducing the bit depth is like adding noise to the original signal.
In fact, it is adding noise to the original signal because each sample has to be moved up or down to the nearest possible available value.
With fewer bits, there are fewer values and therefore more noise is introduced, noise being the error between the quantised signal and the original.
Very rarely do we bother reducing the bit depth, and we stick with 16 bits for almost everything.
With two bits, we can actually see those values.
If we look carefully on this waveform, we can see there are only four different values within the waveform.
Those relate to the four possible values we get with two binary bits.
We started in the time domain, with an analogue signal provided by a microphone.
That's an analogue of the pressure variation measured by that microphone.
But now we have a digital version of that signal.
Going digital means that we can now do all sorts of sophisticated, exciting operations on the signal using a computer.
That's extremely convenient.
But you must always be aware that the digital signal has limitations.
We have made approximations.
The most important limitation to always bear in mind is the sampling frequency.
That's something we might want to vary, depending on our application.
Bit depth is also something to bear in mind, but in practical terms and for our purposes, we're just going to use a universal bit depth of 16.
That's plenty: the quantisation noise is negligible, and we won't generally be varying that value.
Now we have a digital signal, we're ready to do some speech processing.
One of the most important processes is to use Fourier analysis to take us from the time domain to the frequency domain.

Short-term analysis

Because speech sounds change over time, we need to analyse only short regions of the signal. We convert the speech signal into a sequence of frames.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoShort-term analysis is the first step that takes us out of the time domain and into some other domain, such as the frequency domain.
Here's a spoken word.
Clearly, its properties vary over time.
The amplitude varies.
Or, for example, some parts are voiced: this is voiced, this is voiced.
But other parts are not: this part's unvoiced.
Apart from measuring the total duration, it makes no sense to analyse any other properties of a whole utterance.
For example, F0 doesn't just vary over time, it only exists in the voiced regions and doesn't even exist in the unvoiced parts.
Because short-term analysis is the first step, in general we need to perform it without knowing anything about the waveform.
For example, in Automatic Speech Recognition, the analysis takes place before we know what words have been spoken.
So we can't do the following: we can't segment the speech into linguistically-meaningful units and then perform some specific analysis, for example, on this voiced fricative, or this vowel, or this unvoiced fricative.
Rather, we need to use a general-purpose method, which doesn't require any knowledge of the contents of the signal.
To do that, we're going to just divide the signal into uniform regions and analyse each one separately.
These regions are called frames and they have a fixed duration, and that duration is something we have to choose.
Here's the plan: we'll take our whole utterance and we'll zoom into some shorter region of that and perform some analysis.
Then we shift forward in time, analyse that region, then move forward again, analyse that region, and so on, working from the start of the utterance to the end in some fixed steps.
The first thing to decide is how much waveform to analyse at any one time.
The waveform in front of you clearly substantially varies its properties, so we need a shorter region than that.
We'll define a frame of the waveform first by choosing a window function and then multiplying the waveform by this window function.
My window here is the simplest possible one: a rectangular window that is zero everywhere except within the frame I wish to analyse, where it's got a value of one.
We multiply the two, sample by sample, and obtain a frame of waveform that's - if you like - "cut out" of the whole utterance.
This cut-out fragment of waveform is called a frame.
We're then going to move forward a little bit in time and cut out another frame for analysis.
So here's the process:
Cut out of frame of waveform: that's ready for some subsequent analysis.
Move forward some fixed amount in time, cut out another frame, and so on to get a sequence of frames cut out of this waveform.
That's done simply by sliding the window function across the waveform.
Let's take a closer look at one frame of the waveform.
Because I've used the simplest possible rectangular window, we've accidentally introduced something into the signal that wasn't there in the original.
That's the sudden changes at the edge of the signal.
These are artefacts: that means something we introduced by our processing, that's not part of the original signal.
If we analysed this signal we'd not only be analysing the speech but also those artefacts.
So we don't generally use rectangular window functions because these artefacts are bad, but rather we use tapered windows.
When we cut out a frame, it doesn't look like this, but it's cut out with a window function that tapers towards the edges.
Think of that as a fade-in and a fade-out.
That gives us a frame of waveform that looks like this: it doesn't have those sudden discontinuities at the edges.
Here's the complete process of extracting frames from a longer waveform using a tapered window.
Typical values for speech will be a frame duration of 25 ms and a frame shift off something less than that, and that's because we're using these tapered windows.
To avoid losing any waveform, we need to overlap the analysis frames.
So, we'll extract one frame, then the next, and the next, each one having a duration of 25 ms and each one being 10 ms further on in time than the previous one.
We've converted our complete utterance into a sequence of frames.
This representation - the sequence of frames - is the basis for almost every possible subsequent analysis that we might perform on a speech signal, whether that's estimating F0 or extracting features for Automatic Speech Recognition.
With a speech utterance broken down now into a sequence of frames, we're ready to actually do some analysis.
Our first, and most important, destination is the frequency domain, which escapes many of the limitations of doing analysis directly in the time domain.
Now, to get to the frequency domain, we're going to use Fourier analysis and that will be introduced actually in two stages.
First we'll have the rather abstract concept of Series Expansion, and then we'll use that to explain Fourier analysis itself.

Series expansion

Speech is hard to analyse directly in the time domain. So we need to convert it to the frequency domain using Fourier analysis, which is a special case of series expansion.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoBecause speech changes over time, we've already realised that we need to analyse it in the short-term.
We need to break it into frames and perform analysis frame by frame.
One of the most important analyses is to get into the frequency domain.
We're going to use Fourier analysis to do that, but we're going to introduce that in two stages.
So the first topic is actually going to seem a little abstract.
There's a reason for introducing series expansion as an abstract concept, and that's because it has several different applications.
The most important of those is Fourier analysis, but there will be others.
Here's a speech-like signal that we'd like to analyse.
So maybe we should really say what 'analyse' means.
Well, how about a concrete example?
I'd like to know what this complicated waveform is 'made from'.
One way to express that is to say that it's a sum of very simple waves.
So we're going to expand it into a summation of the simplest possible waves.
That's what series expansion means.
For the purpose of the illustrations here, I'm just going to take a single pitch period of this waveform.
That's just going to make the pictures look a little simpler.
But actually everything we're going to say applies to any waveform.
It's not restricted to a single pitch period.
I'm going to express this waveform as a sum of simple functions.
Those are called basis functions and as the basis function I'm going to use the simplest possible periodic signal there is: the sine wave.
So let's try making this complex wave by adding together sine waves.
This doesn't look much like a sine wave, so maybe you think that's impossible.
Not at all!
We could make any complex wave by adding together sine waves.
I've written some equations here.
I've written that the complex wave is approximately equal to something that I'm going to define.
Let's try using just one sine wave to approximate this.
Try the sine wave here, and the result is not that close to the original.
So let's try adding a second basis function at a higher frequency.
Here it is: now if we add those two things together, we got a little closer
I'll add a third basis function at a higher frequency still, and we get a little closer still, and the fourth one, and we get really quite close to the original.
It's not precisely the same, but it's very close.
I've only used four sine waves there.
The first sine wave has the longest possible fundamental period: it makes one cycle in the analysis window.
The second one makes two cycles.
The third one makes three cycles, then four cycles and so on.
So they form a series.
Now, I can keep adding terms to my summation to get as close as I want to the original signal.
So let's keep going.
I'm not going to show every term because there's a lot of them.
But we keep adding terms and eventually, by adding enough terms going up to a high enough frequency, we will reconstruct exactly our original signal.
Now we're not just approximating the signal, it is actually now equal.
Theoretically, if this was all happening with analogue signals, I might need to add together an infinite number of terms to get exactly the original signal.
But these are digital signals.
That means that this analysis frame has a finite number of samples in it.
This waveform is sampled at 16 kHz and it lasts 0.01 s.
That means there are 160 samples in the analysis frame.
Because there's a finite amount of information, I only need a finite number of basis functions to exactly reconstruct it.
Another way of saying that is that these basis functions are also digital signals, and the highest possible frequency one is the one at the Nyquist frequency
So if I sum up basis functions all the way up to that highest possible frequency one, I will exactly reconstruct my original signal.
So what exactly have we achieved by doing this?
We've expressed the complex wave on the left as a sum of basis functions, each of which is a very simple function: it's a sine wave, at increasing frequency.
We've had to add together a very specific amount of each of those basis functions to make that reconstruction.
We need 0.1 of this one and 0.15 of this one and 0.25 of this one and 0.2 of this one and just a little bit of this one, and whatever the terms in between might be, to exactly reconstruct our signal.
This set of coefficients exactly characterises the original signal, for a fixed set of basis functions.
Because we can choose how many terms we have in the series - we can go as far as we like down the series but then stop any where we like - we actually get to choose how closely we represent the original signal.
Perhaps all of this fine detail on the waveform is not interesting: it's not useful information.
Maybe it's just noise, and we'd like to have a representation of this signal that removes that noise.
Well, series expansion gives us a principled way to do that.
We can just stop adding terms.
This signal might be a noisy signal and this signal is a denoised version of that signal.
It removes the irrelevant information (if that's what we think it is).
The main point of understanding series expansion is as the basis of Fourier analysis, which transforms a time domain signal into its frequency domain counterpart.
But we will find other uses for series expansion, such as the one we just saw, of truncation to remove unnecessary detail from a signal.
What we learned here is not restricted to only analysing waveforms.
There's nothing in what we did that relies on the horizontal axis being labelled with time: it could be labelled with anything else.
Fourier analysis will then do what we've been trying to do for some time: to get us from the time domain into the frequency domain where we can do some much more powerful analysis and modelling of speech signals.

Fourier analysis

We can express any signal as a sum of sine waves that form a series. This takes us from the time domain to the frequency domain.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoWe're now going to use a series expansion approach to get from a digital signal in the time domain to its frequency domain representation.
All the signals are digital.
We understand how series expansion works, in a rather abstract way.
We're now going to make that concrete, as Fourier analysis.
Let's just recap series expansion.
We saw how it's possible to express a complex wave, for example this one, as a sum of simple basis functions.
We wrote the complex wave as equal to a weighted sum of sine waves.
Let's write that out a little more correctly.
We have some coefficient - or a weight - times a basis function.
These basis functions have a unit amplitude, so they're scaled by their coefficient and then added together.
We add some number of basis functions in a series, to exactly reconstruct the signal we're analysing.
So this is the analysis: the summation of basis functions weighted by coefficients.
Notice how those basis functions - those sine waves - are a series.
Each one has one more cycle in the analysis frame than the previous one.
These are coefficients and we now need to think about how to find those coefficients, given only the signal being analysed and some pre-specified set of basis functions.
Here is a series of basis functions: just the first four to start with.
I want you to write down their frequencies and work out the relationship between them.
Pause the video.
The duration of the analysis window is 0.01 s.
The lowest frequency basis function makes one cycle in that time, meaning it has a frequency of 100 Hz.
I'm going to start writing units correctly: we put a space between the number and the units.
The second one makes two cycles in the same amount of time, so it must have a frequency of 200 Hz.
The next one makes three cycles, that's 300 Hz.
And 400 Hz.
I hope you got those values.
It's just an equally-spaced series of sine waves, starting with the lowest frequency and then all the multiples of that, evenly spaced.
If I tell you now that the sampling rate is 16 kHz, there's lots more basis functions to go.
What's the highest frequency basis function that you can have?
Pause the video.
Well, we know from digital signals that the highest possible frequency we can represent is at the Nyquist frequency, which is simply half the sampling frequency.
The Nyquist frequency here would be 8 kHz.
We'd better zoom in so we can actually see that.
There we go: this waveform here has a frequency of 8 kHz.
We can't go any higher than that.
Fourier analysis simply means finding the coefficients of the basis functions.
We need somewhere to record the results of our analysis, so I've made some axes on the right.
This horizontal axis is going to be the frequency of the basis function.
Because we're going to go up to a basis function at 8000 Hz (that's 8 kHz), I'll give that units of kHz.
On the vertical axis, I'm going to write the value of the coefficient.
I'm going to call that magnitude.
Here's the lowest frequency basis function.
It's the one at 100 Hz.
So I'm going to plot on the right at 100 hertz (that's 0.1 kHz, of course) how much of this basis function we need to use to reconstruct our signal.
How do we actually work that amount out?
We're going to look at the similarity between the basis function and the signal being analysed.
That's a quantity known as correlation.
That's achieved simply by multiplying the two signals sample by sample.
So we multiply this sample by this sample and add it to this sample by this sample, and this sample by this sample, and so on and add all of that up.
That will give us a large value when the two signals are very similar.
In this example, if I do that for this lowest frequency basis function, I'm going to get a value 0.1.
Let's put some scale on this.
Then I'll do that for the next basis function
That's going to be at 0.2 kHz and I do the correlation and I find out that I need 0.15 of this one.
Then I do the next one, 0.3 kHz, and I find that I need 0.25 of this one.
Then the next one, 0.4 kHz, and I find that I need 0.2 of that one; and so on.
I've plotted a function on the right.
Let's just join the dots to make it easier to see.
This is called the spectrum.
It's the amount of energy in our original signal at each of the frequencies of the basis functions.
We now need to talk about a technical but essential property of Fourier analysis, where the basis functions are sine waves (in other words pure tones).
They contain energy at one and only one frequency.
That means that any pair of sine waves in our series are orthogonal.
Let's see what that means.
Take a pair of basis functions: any pair.
I'll take the first two, and work out the correlation between these two signals.
So multiply them sample by sample: this one by this one, this one by this one, and so on, and work out that sum.
For this pair, it will always be zero.
We can see that simply by symmetry.
These two signals are orthogonal.
There is no energy at this frequency contained in this signal, and vice versa.
The same thing will happen for any pair.
The correlation between them is zero.
There is no energy at this frequency in this waveform, and vice versa.
This property of orthogonality between the basis functions means that when we decompose a signal into a weighted sum of these basis functions, there is a unique solution to that.
In other words, there is only one set of coefficients that works.
That uniqueness is very important.
It means that there's same information in the set of coefficients as there is in the original signal.
It's also easy to invert this transform.
We could just take the weighted sum of basis functions and get back the original signal perfectly.
So Fourier analysis, then, is perfectly invertible, and gives us a unique solution.
We could go from the time domain to the frequency domain, and back to the time domain as many times as we like, and we lose no information.
We've covered, then, the essential properties of Fourier analysis.
It uses sine waves as the basis function.
There is a series of those from the lowest frequency one (and that frequency will be determined by the duration of the analysis window) up to the highest frequency one (and that will be determined by the Nyquist frequency).
We said Fourier 'analysis', but this conversion from time domain to frequency domain is often called a 'transform'.
So from now on we'll more likely say the 'Fourier transform'.
The Fourier transform is what's going to now get us into the frequency domain.
That's one of the most powerful and widely used transformations in speech processing.
We do a lot more processing in the frequency domain than we ever do in the time domain.

Frequency domain

We complete our understanding of Fourier analysis with a look at the phase of the component sine waves, and the effect of changing the analysis frame duration.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoYou already know the essential features of Fourier analysis, but we've glossed over a little detail called phase.
So we need to now clarify that, as well as then using Fourier analysis to transform any time domain signal into its spectrum: more correctly, its magnitude spectrum as we're going to see.
From now on will be calling this the Fourier transform.
We take a time domain signal.
We break it into short analysis frames.
On each of those, we perform a series expansion.
The series of basis functions are made of sine waves of increasing frequency.
That is Fourier analysis and that gets us now into the frequency domain.
In the general case, we need to worry not just about the magnitude of each of the basis functions, but also something called phase.
Consider this example here.
Look at the waveform on the left.
Look at the basis functions on the right.
See if you can come up with a set of coefficients that would work.
Well, fairly obviously you cannot, because all the basis functions are zero at time zero.
We're trying to construct a non-zero value at time zero, and there are no weights in the world that will give us that.
So there's something missing here.
This diagram is currently untrue, but I can make it true very easily by just shifting the phase of the basis function on the right.
Now the diagram is true!
Phase is simply the point in the cycle where the waveform starts.
Another way to think about it is that we can slide the basis functions left and right in time.
So when we are performing Fourier analysis, we don't just need to calculate the magnitude of each of the basis functions, but also their phase.
But does phase matter?
I mean, what does phase mean?
Is it useful for anything?
Here I've summed together four sine waves with this set of coefficients to make a waveform that speech-like.
There's one period of it on the right.
I'm going to play you a longer section of that signal.
OK, it's obviously not real speech!
I mean, I just made it by adding together these sine waves with those coefficients.
But it's got some of the essential properties of speech.
For example, it's got a perceptible pitch, and it's not a pure tone.
I'm going to use the same set of coefficients, but I'm going to change the phases of the basis functions.
So, exactly the same basis functions, they just start at different points in their cycle.
The resulting signal now looks very different to the original signal.
Do you think it's going to sound different?
Well, let's find out.
No, it sounds exactly the same to me.
Our hearing is simply not sensitive to this phase difference.
So for the time being, we're just going to say that phase is not particularly interesting.
Our analysis of speech will only need to worry about the magnitudes.
In other words, these are the important parts.
These phases - exactly where these waveforms start in their cycle - is a lot less important.
In fact, we're just going to neglect the phase from now on.
If we plot just those coefficients, we get the spectrum, and that's what I've done here.
On the left is the original signal, and its magnitude spectrum.
On the right is the signal with different phases, but the same magnitudes: its magnitude spectrum is identical.
We'll very often hear this called the spectrum, but more correctly we should always say the 'magnitude spectrum' to make it clear that we've discarded the phase information.
Something else that's very important that we can learn from this picture is that in the time domain two signals might look very different, but in the magnitude spectrum domain, they're the same.
Now that's telling us that the time domain might not be the best way to analyse speech signals.
The magnitude spectrum is the right place.
Because the amount of energy at different frequencies in speech can vary a lot - it's got a very wide range - the vertical axis of a magnitude spectrum is normally written on a log scale and we give it units of decibels.
This is a logarithmic scale.
But like the waveform, it's uncalibrated because, for example, we don't know how sensitive the microphone was.
It doesn't really matter because it's all about the relative amount of energy at each frequency, not the absolute value.
Back to the basis sine waves for a moment.
They start from the lowest possible frequency, with just one cycle fitting the analysis frame, and they go all the way up to the highest possible frequency, which is the Nyquist frequency.
They're spaced equally and the spacing is equal to the lowest frequency.
Here it's 100 Hz.
It's 100 Hz because the analysis frame is 1/100th of a second.
So what happens if we make the analysis frame longer?
Imagine we analyse a longer section of speech than 1/100th of a second.
Have a think about what happens to the set of basis functions.
Pause the video.
Well, if we've got a longer analysis window, that means that the lowest frequency sine wave that fits into it with exactly one cycle will be at a lower frequency, so this frequency will be lower.
We know that the series are equally spaced at that frequency, so they'll all go lower and they'll be more closely spaced.
But we also know that the highest frequency one is always at the Nyquist frequency.
So if the lowest frequency basis function is of a lower frequency and they're more closely spaced, then we'll just have more basis functions fitting into the range up to the Nyquist frequency.
A longer analysis frame means more basis functions.
This will be easier to understand if we see it working in practise.
Here I've got a relatively short analysis frame and on the right, I'm showing its magnitude spectrum: that's calculated automatically with the Fourier transform.
Let's see what happens as the analysis frame gets larger.
Can you see how a bit more detail appeared in the magnitude spectrum?
Let's go out some more, and even more detail appeared.
In fact there's so much detail, we can't really see it now.
So what I'm going to do is I'm actually going to just show you a part of the frequency axis: a zoomed-in part.
The spectrum still goes up to the Nyquist frequency, but I'm just going to show you the lower part of that, so we see more detail.
So there's the very short analysis frame and its magnitude spectrum.
Zoom out a bit and a bit more detail appears in the magnitude spectrum.
We make the analysis frame longer still, and we get a lot more detail in the magnitude spectrum.
So a longer analysis frame means that we have more components added together (more basis functions), therefore more coefficients.
Remember that the coefficients are just spaced evenly along the frequency axis up to the Nyquist frequency, and so we're just going to get them closer together as we make the analysis frame longer, so we see more and more detail in the magnitude spectrum.
Analysing more signal gives us more detail on the spectrum.
This sounds like a good thing, but of course that spectrum is for the entire analysis frame.
It's effectively the average composition of a signal within the frame.
So a larger analysis frame means we're able to be less precise about where in time that analysis applies to, so we get lower time resolution.
So it's going to be a trade-off.
Like in all of engineering, we have to make a choice.
We have to choose our analysis frame to suit the purpose of the analysis.
It's a design decision.
The next steps involve finding, in the frequency domain, some evidence of the periodity in the speech signal: the harmonics.
But that will only be half the story, because we haven't yet thought about what the vocal tract does to that sound source.
Our first clue to that will come from the spectral envelope.
So we're going to look at two different properties in the frequency domain.
We see both of them together in the magnitude spectrum, one being the spectral envelope, and the other being the harmonics.