Including the important issue of labelling the data
Hunt & Black: Unit selection in a concatenative speech synthesis system using a large speech database
The classic description of unit selection, described as a search through a network.
Clark et al: Multisyn: Open-domain unit selection for the Festival speech synthesis system
A description of the implementation and evaluation of Festival’s unit selection engine, called Multisyn.
Jurafsky & Martin – Section 8.5 – Unit Selection (Waveform) Synthesis
A brief explanation. Worth reading before tackling the more substantial chapter in Taylor (Speech Synthesis course only).
Taylor – Chapter 16 – Unit-selection synthesis
A substantial chapter covering target cost, join cost and search.
Clark et al: Festival 2 – build your own general purpose unit selection speech synthesiser
Discusses some of the design choices made when writing Festival’s unit selection engine (Multisyn) and the tools for building new voices.
Interactive toy demo
A short video demonstration of unit selection. You can find the actual interactive demo on this website. Have a play with it yourself!
Search
With multiple candidates available for each target position, a search must be performed.
Target cost and join cost
To choose between the many possible sequences of candidate units, we need to quantify how good each possible sequence will sound.
Target and candidate units
We use the linguistic specification from the front end to define a target unit sequence. Then, we find all potential candidate units in the database.
Key concepts
Linguistic context affects the acoustic realisation of speech sounds. But several different linguistic contexts can lead to almost the same sound. Unit selection takes advantage of this “interchangeability”.
Taylor – Chapter 3 – The text-to-speech problem
Discusses the differences between spoken and written forms of language, and describes the structure of a typical TTS system.