Taylor – Section 17.1 – Databases

Including the important issue of labelling the data

Hunt & Black: Unit selection in a concatenative speech synthesis system using a large speech database

The classic description of unit selection, described as a search through a network.

Clark et al: Multisyn: Open-domain unit selection for the Festival speech synthesis system

A description of the implementation and evaluation of Festival’s unit selection engine, called Multisyn.

Jurafsky & Martin – Section 8.5 – Unit Selection (Waveform) Synthesis

A brief explanation. Worth reading before tackling the more substantial chapter in Taylor (Speech Synthesis course only).

Taylor – Chapter 16 – Unit-selection synthesis

A substantial chapter covering target cost, join cost and search.

Clark et al: Festival 2 – build your own general purpose unit selection speech synthesiser

Discusses some of the design choices made when writing Festival’s unit selection engine (Multisyn) and the tools for building new voices.

Interactive toy demo

A short video demonstration of unit selection. You can find the actual interactive demo on this website. Have a play with it yourself!

Search

With multiple candidates available for each target position, a search must be performed.

Target cost and join cost

To choose between the many possible sequences of candidate units, we need to quantify how good each possible sequence will sound.

Target and candidate units

We use the linguistic specification from the front end to define a target unit sequence. Then, we find all potential candidate units in the database.

Key concepts

Linguistic context affects the acoustic realisation of speech sounds. But several different linguistic contexts can lead to almost the same sound. Unit selection takes advantage of this “interchangeability”.

Taylor – Chapter 3 – The text-to-speech problem

Discusses the differences between spoken and written forms of language, and describes the structure of a typical TTS system.