Module 3 – Digital Speech Signals

What are spectrograms really? An introduction to Digital Signal Processing and the Discrete Fourier Transform

Start Videos Readings Lab Quiz Finish

Remember to watch the module videos before the Thursday Lecture!

In this module we will look at how the speech produced by a physical system is captured as a digital signal, i.e., how it goes from a pressure wave to a sequence of 0s and 1s on your computer. We will see how engineering decisions affects what we can capture and what sort of analysis we can do. The most important constraint we will come up against is sampling rate. Given a digitized waveform, we then introduce the Discrete Fourier Transform (DFT) as a method of mapping from the time domain to the frequency domain. The DFT is what allows us to create spectrograms. We’ll see that the frequency domain is a much more more convenient place to do speech processing than the time domain. But, again, the fact that we’re working with digital signals determines what we actually get out of a spectrogram.

The Discrete Fourier Transform (and signal processing in general) uses sine and cosine functions a lot. If you haven’t thought about sines and cosines for a while (SOH CAH TOA ring any bells?), you might also want to brush up on some trigonometry and vectors:

a great intro to vectors/linear algebra
a quick trig refresher
a nice video on radians
A great primer on sine waves, trigonometry and sampling and other concepts relating to Digital Signal Processing which this Module will touch on.

We won’t ask you to derive the DFT equation etc, but knowing a bit more maths will help develop your intuitive understanding of what’s going on here.

Here’s what you are going to learn in this module’s videos:

Lecture Slides

Lecture 3 slides (google slides) [updated 1/10/2024 ]

Total video to watch in this section: 37 minutes.

We’re moving to the Wayland (2019) textbook for modules 3 and 4, though the material will presented in a somewhat different order. We’re focusing on digital signals this week, but you might also want to look at Chapter 6 “Basic Acoustics” this week (it’s also the module 4 essential reading, so if you read it this week you’ll be ahead for that. If you don’t get to it until next week, don’t worry!). You will have already got some of this from the previous module readings (e.g. source-filter model) but we’ll go into more details in the next weeks.

The background reading on digital signals is fairly high level. If you want to see more maths, you can look at the extension materials in the labs or the references mentioned there.

Reading

Wayland (Phonetics) – Chapter 7 – Digital Signal Processing

An intuitive introduction to acoustics digital signal processing

Wayland (Phonetics) – Chapter 6 – Basic Acoustics

A concise introduction to acoustics: sounds, resonance, and the source-filter theory

Schaedler – Seeing Circles, Sines and Signals

A very nice concise primer on the basic components of digital signal processing with great visual demonstrations.

Handbook of phonetic sciences – Ch 20 – Intro to Signal Processing for Speech (Sections 1-5)

Written for a non-technical audience, this gently introduces some key concepts in speech signal processing. Read sections 1-5 (up to and including 'Fourier Analysis').

Lyons – Understanding Digital Signal Processing

A great introduction to digital signal processing, including the maths!

This is SIGNALS lab is about taking signals from the time domain into the frequency domain and will focus on analysing digital signals with the Discrete Fourier Transform.

The signal processing labs will use Jupyter notebooks: a combination of Python code and notes that you access using a web browser.

You can find a github repository for the signals notebooks here: https://github.com/laic/uoe_speech_processing_course. You can find the notebooks for this module in that repository under signals/signals-lab-1.

Getting started with Jupyter Notebooks

Don’t worry if you don’t know any Python – this is not a formal requirement of the course, and you’ll learn what you need simply by doing the exercises. You can run the notebooks on your personal computer but we’d suggest that you use Edina Notable service You can access from the Learn site for this course. You can also use direct login if you are already logged into Learn.

Use a web browser to open the instructions; read them through once before doing anything else.
The default would be to then follow “Run the notebooks online using Edina Noteable“.
- If you feel confident or already know Jupyter notebooks, you could also run the notebooks on your own computer: follow “Running Jupyter Notebooks on your computer“
- This involves installing Python 3.9 and Miniconda on your own computer, but these are things you will find useful more generally in the future too. This uses around 1 GB of disk space.
Once you have succeeded, finish this task by completing Section 4 of the Jupyter notebook sp-m0-how-to-start

You can always ask the tutors in the lab sessions for help with setting things up!

For further technical support in setting up Jupyter Notebooks, use this forum.

Please note, the material in notebooks is to support your learning. Nothing related to the code specifically is directly assessed (though the concepts in the notebooks marked essential may be). You don’t need to know Python to do this course, though MSc SLP students will need to know Python for many other courses so getting a more bit of practise/exposure definitely doesn’t hurt!

Notebooks for this module

First have a look at the guide to the signals notebooks. You can just look in the signals directory once you’re downloaded your own copy of the notebook repository.

After that, there is one essential notebook to work through:

Digital Speech Signals

The lab is setup so the focus is mostly on changing small bits of existing code. However, if you are already experienced with python and numpy/matplotlib/librosa, you may want to try coding up some of the steps we take yourself.

Those notebooks are relatively light on the maths behind these technologies, but there are also some extension materials in signals-lab-1/extension directory that go into the details more.

Answers/Notes for Module 3 lab

signals-1-digital-speech-signals-answers.ipynb

What does a microphone do?

A microphone converts variations in air pressure to an analogue electrical signal.

Describe what these two signals would sound like: 1) a sequence of impulses at 5 Hz; 2) a sequence of impulses at 200 Hz.

Signal 1 will be perceived as a rapid succession of individual click sounds, but signal 2 will be perceived as a continuous sound with a pitch.

An engineering term used to describe periodic signals is "deterministic". What is the corresponding term for aperiodic signals?

Stochastic.

What difference in the fundamental frequency of a signal is required to change its pitch by one octave?

The fundamental frequency needs to be doubled to raise the pitch by one octave, or halved to lower the pitch by one octave.

What is the difference between "sampling" and "quantisation"?

Sampling is the process of recording the amplitude of a signal only at specific moments in time (generally, that means a fixed number of times per second, evenly spaced in time). Each recorded value is called a sample. Quantisation is the process of storing the amplitude of each sample with a fixed precision (generally, that means as a binary number with a fixed number of bits).

When converting a waveform to a sequence of frames, why is the frame shift usually smaller than the frame duration?

Because a tapered window is applied to each frame.

How can series expansion be used to remove high-frequency noise from a waveform?

By truncating the series, which means setting to zero the coefficients of all the basis functions above a frequency of our choosing.

In Fourier analysis, what are the frequencies of the lowest and highest basis functions?

The lowest frequency basis function has a fundamental period equal to the analysis frame duration. The highest frequency basis function is at the Nyquist frequency.

Why is a logarithmic scale usually used for magnitude when plotting the magnitude spectrum?

Because of the wide variation in the amount of energy at different frequencies. For example, the amount energy in the lower frequency region around F0 is many times greater than the amount of energy above, say, 4 kHz. Try plotting the magnitude spectrum of a speech sound on a linear scale yourself, to illustrate this.

Module 3 gave an introduction to digital signal processing. We hope you can now see the connection between articulatory and acoustic phonetics, and how we might use our knowledge of this to start building computational models for the analysis of speech. Now is a good time to think about what computational models would need to capture in order to characterise different speech sounds, e.g. properties of vowels (like formants), and consonants (like frication or bursts).

The fact that we have to digitize the speech signal and do short term analysis on speech results in specific design decisions, e.g. the sampling rate, window size. We can attempt to get a ‘better’ view of the spectral characteristics of speech by engineering the size and type of windows we use as inputs to the Discrete Fourier Transform (DFT).

Both signal processing and acoustic phonetics are massive fields. We do not except you to show mastery of both in only a few weeks! For the purposes of this course, you should be able to:

Explain how sampling rate determines which frequencies can be captured in a digital speech signal, i.e., What is aliasing? What is the Nyquist Frequency?
Explain how the DFT is used to generate a spectrogram
Describe what the output of the DFT is, what the magnitude and phase spectrums are, and why we only visualize the first half of the magnitude spectrum in a a spectrogram.
Describe how input size and sampling rate determine which frequencies can be analysed in a spectrogram
Describe what spectral leakage is and when it occurs
Describe how window shape can affect the shape of the magnitude spectrum

It’s out of scope for this class, but for real applications we also have to consider how fast our algorithms are. In general, you will be using the Fast Fourier Transform (FFT), an implementation of the DFT that allows us to make use of the certain repetitions/overlaps in how we calculate the separate DFT outputs to get the results faster. The python numpy FFT function is used in the module 3 lab, notebook 1, to show you what to expect if you try an off-the-shelf DFT implementation. The time gains aren’t really noticeable for the small examples used in the the lab notebooks, but when dealing with real world data, the optimizations of the FFT make a big difference. In general, you will see that there are many ways to solve the same problem. We’ll come back to this later in the course, when we look at an algorithmic method called dynamic programming.

What you should know

Some more detailed notes on what’s examinable from this module:

Digital Signals:

Explain how bit depth (i.e. quantisation) effects the quality of a digitized speech signal
Sampling rate: Explain how sampling rate determines which frequencies can be captured in a digital speech signal, i.e. how this relates to:
- Aliasing
- Nyquist Frequency

Short Term analysis: Why do we do short term analysis on speech (i.e. windowing)?

Series expansion, Fourier analysis, frequency domain:

What do we uses the Discrete Fourier Transform for?
- i.e. mapping from the ime domain to Frequency domain
- Interpret the DFT as a series expansion of a complex waveform
Describe what the output of the DFT is:
- If the input is a sequence length N, how many outputs are there?
- What do the magnitude and phase spectrums represent? What’s on the x and y axes?
- Why do we only visualize the first half of the magnitude spectrum in a a spectrogram? (i.e. link to aliasing)
Describe how input size and sampling rate determine which frequencies can be analysed in a spectrogram:
- Calculate what frequencies are represented in the DFT output
- When does spectral leakage occur? (see Lecture, lab notebook)
Describe how window shape can affect the shape of the magnitude spectrum, i.e. why do we want a tapered window?
What’s the relationship between the DFT and what we actually see in a spectrogram?
What’s the difference between a narrow versus a wide band spectrogram