Module 5 – speech synthesis – waveform generation

Manipulating recorded speech signals to create new utterances.

Start Videos Readings Tutorial A

Module status: complete.

This modules moves away temporarily from speech signals, and is about the text-processing required for Text-To-Speech (TTS). This part of a TTS system is often called the ‘front end’.

Here’s what you’re going to learn in the videos:

Total video to watch in this section: 23 minutes

There are some extra readings on signal processing in this module, if you’d like to revisit that material from another author’s perspective. There is actually a large amount of further signal processing material in Taylor’s book, which is worth exploring.

Reading

Jurafsky & Martin – Section 8.4 – Diphone Waveform Synthesis

A simple way to generate a waveform is by concatenating speech units from a pre-recorded database. The database contains one recording of each required speech unit.

Jurafsky & Martin – Section 8.5 – Unit Selection (Waveform) Synthesis

A brief explanation. Worth reading before tackling the more substantial chapter in Taylor (Speech Synthesis course only).

Holmes & Holmes – Chapter 5 – Message synthesis from stored human speech components

Pitch-synchronous overlap-and-add (PSOLA) remains a key technique in speech signal processing.

Taylor – Section 10.1 – Analogue signals

It's easier to start by understanding physical signals - which are analogue - before we then approximate them digitally.

Taylor – Section 10.2 – Digital signals

Going digital involves approximations in the way an original analogue signal is represented.

Taylor – Section 12.7 – Pitch and epoch detection

Only an outline of the main approaches, with little technical detail. Useful as a summary of why these tasks are harder than you might think.

Holmes & Holmes – Chapter 6 – Phonetic Synthesis by Rule

Mainly of historical interest.

Ladefoged (Elements) – Chapter 10 – Fourier analysis

An attempt to explain Fourier analysis. Although chapters 1-9 are great, I actually do not recommend chapter 10.

Ladefoged (Elements) – Chapter 11 – Digital filters and LPC analysis

A brave attempt to use 'long hand' to spell out how LPC analysis works, but not a recommended reading.

This is a SKILLS tutorial about writing up the first assignment.

Prepare for the tutorial session

There will be an exercise for which you need to bring a sample of your writing (150-200 words). Prepare this is a Word document (even if you’re writing up in LaTeX or another tool) named with your name and student number (e.g., s1234567_Jane.doc) This should be part of the draft of your Speech Processing coursework. Be ready to share it only with the tutor at the start of the session (you can send it via a direct message in Teams).

After the tutorial session

Each tutorial group will pool all their writing samples. Each student will select a writing sample from another student in the group. As far as possible, choose one written by someone who speaks a different first language to you. For example, native speakers should select one written by a non-native, a speaker of German might choose one written by a speaker of Chinese, and so on.

Below the original text that you receive, write a version that is shorter and clearer. Return it to the author.

Practical assignment

Continue the first assignment. Complete all the milestones to date. Use the forums to get help.