Digital signal

To do speech processing with a computer, we need to convert sound first to an analogue electrical signal, and then to a digital representation of that signal.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoThe very first step in processing speech is to capture the signal in the time domain and create a digital version of it in our computer.
Here's a microphone being used to measure a sound wave.
The output of the microphone is an analogue voltage.
We need to convert that to a digital signal so we can both store it and then manipulate it with a computer.
How do we convert an analogue signal to a digital one?
In other words, what is happening here?
In the analogue domain, things move smoothly and continuously.
Look at the hands on this analogue watch, for example.
It tells the time with infinite precision, not just to the nearest second.
But in stark contrast, in the digital domain, there are a fixed number of values that something can take: there is finite precision.
So this digital clock does only tell the time to the nearest second.
It has made time discrete, and that's an approximation of reality.
So why cannot computers store analogue values?
It's because computers only store binary numbers, nothing else.
Everything has to be represented as a binary number.
It has to be placed in the finite amount of storage available inside the computer.
So, for our waveform, there are two implications of that.
1) we have to represent the amplitude of the waveform with some fixed precision, because it's going to have to be a binary number.
2) we can only store that amplitude a finite number of times per second, otherwise we would need infinite storage.
Start by considering these binary numbers.
With one bit, we have two possible values.
With two bits, we get four values.
With three bits, we get eight values, and so on.
The amplitude of our waveform has to be stored as a binary number.
But let's first consider making time digital: making time discrete.
Let's zoom into this speech waveform.
It appears to be smooth and continuous, but zoom in some more, and keep zooming in, and eventually we'll see that this waveform has discrete samples.
The line joining up the points on this plot is just to make it pretty; it's to help you see the waveform.
In reality, the amplitude is only stored at these fixed time intervals.
Each point in this plot is a sample of the waveform.
Let's first decide how frequently we should sample the waveform.
I'm drawing this sine wave in the usual way with a line joining up the individual samples, and you can't see those samples, so I'll put a point on each sample.
This is sampled so frequently, we can barely see the individual points.
But let's reduce the sampling rate.
There are fewer samples per second, and now you can see the individual samples.
Remember, the line is just a visual aid: the waveform's value is defined only at those sample points.
Keep reducing the sampling rate, and that's as far as we can go.
If we go any lower than this, we won't be able to store the sine wave.
It won't go up and down once per cycle
We have discovered that, to store a particular frequency, we need to have a least two samples per cycle of the waveform.
Another way of saying that is: the highest frequency that we can capture is half of the sampling frequency.
That's a very special value, so special it has a name, and it's called the Nyquist frequency.
A digital waveform cannot contain any frequencies above the Nyquist frequency, and the Nyquist frequency is just half the sampling frequency.
But what would happen then, if we did try to sample a signal whose frequency is higher than the Nyquist frequency?
Here's a sine wave and let's take samples of it less often than the Nyquist frequency.
To make it easier to see what's happening, I'm going to draw a line between these points.
This waveform doesn't look anything like the original sine wave!
We've created a new signal that's definitely not a faithful representation of the sine wave
This effect of creating a new frequency, which is related to the original signal and to the sampling frequency, is called aliasing.
It's something to be avoided!
Whenever we sample on analogue signal, we must first remove all frequencies above the Nyquist frequency, otherwise we'll get aliasing.
We must also do that if we take a digital signal like this one on reduce its sampling frequency.
Let's listen to the effect of sampling frequency.
These are all correctly-sampled signals.
We've removed everything below the Nyquist frequency before changing the sampling rate.
For speech, a sampling rate of 16 kHz is adequate, and that sounds fine.
Let's listen to reducing the sampling rate.
We've lost some of the high frequencies.
We've lost even more of the high frequencies
And even more of them.
Even at this very low sampling rate of 4 kHz, speech is still intelligible.
We can still perceive pitch, but we've lost some of the sounds.
The fricatives are starting to go because they're at higher frequencies.
Hopefully, you've noticed that I've been using a few different terms interchangeably.
I've said 'sampling frequency', I've said 'sampling rate', or perhaps just 'sample rate'.
Those are all interchangeable terms that mean the same thing.
So we've dealt with making time discrete.
That's the most important decision: to choose the sampling rate.
For Automatic Speech Recognition, 16 kHz will be just fine, but for Speech Synthesis typically we'd use a higher sampling rate than that.
Let's turn over to making amplitude digital or amplitude discrete.
Here's waveform that I've sampled: I've chosen the sampling rate and we have samples evenly spaced in time.
Now I've got to write down the value of each sample, and I've got to write that down as a binary number, and that means I have to choose how many bits to use for that binary number.
Maybe I'll choose to use two bits, and that will give me four levels.
So each one of these samples would just have to be stored as the nearest available value: that's called quantisation.
We need to choose a bit depth, but there is a very common value, and that's 16 bits per sample, and that gives us 2 to the power 16 available discrete levels.
We have to use them to span both the negative and positive parts of the amplitude axis.
So just sometimes in some pieces of software, you might see the amplitude axis labelled with a sample value.
That would go from -32,768 up to +32,767 because one of the values has to be zero.
The number of bits used to store each sample is called the bit depth.
Let's listen to the effect of changing the bit depth, and in particular reducing it from this most common value of 16 bits to some smaller value.
That sounds absolutely fine.
That sounds pretty good.
Listen on headphones, and you might hear small differences.
That sounds pretty nasty.
Brace yourself: we're going down to two bits...
Pretty horrible!
It's quite different though to the effect of changing the sampling frequency.
Reducing the bit depth is like adding noise to the original signal.
In fact, it is adding noise to the original signal because each sample has to be moved up or down to the nearest possible available value.
With fewer bits, there are fewer values and therefore more noise is introduced, noise being the error between the quantised signal and the original.
Very rarely do we bother reducing the bit depth, and we stick with 16 bits for almost everything.
With two bits, we can actually see those values.
If we look carefully on this waveform, we can see there are only four different values within the waveform.
Those relate to the four possible values we get with two binary bits.
We started in the time domain, with an analogue signal provided by a microphone.
That's an analogue of the pressure variation measured by that microphone.
But now we have a digital version of that signal.
Going digital means that we can now do all sorts of sophisticated, exciting operations on the signal using a computer.
That's extremely convenient.
But you must always be aware that the digital signal has limitations.
We have made approximations.
The most important limitation to always bear in mind is the sampling frequency.
That's something we might want to vary, depending on our application.
Bit depth is also something to bear in mind, but in practical terms and for our purposes, we're just going to use a universal bit depth of 16.
That's plenty: the quantisation noise is negligible, and we won't generally be varying that value.
Now we have a digital signal, we're ready to do some speech processing.
One of the most important processes is to use Fourier analysis to take us from the time domain to the frequency domain.