Build the voice

The final stages of building the voice involve creating the information needed by the target and join costs, plus the representation of the speech needed for waveform generation.

We’re nearly there, and the remaining steps are mostly fully automatic.

Utterance structures
The target cost in Festival is computed using linguistic information, so we need to provide that information for all the candidate units in the database. This information is stored in utterance structures.
Pitch tracking
One component of the join cost is the fundamental frequency, F0. This is extracted separately from the pitch marks, although the two things are obviously closely related.
Join cost coefficients
The join cost measures potentially-audible mismatch at the points where candidate units from the database are joined. To make the runtime synthesis faster, we can precompute the acoustic features that are used by the join cost.
Waveform representation
Although unit selection is essentially the concatenation of pre-recorded waveform fragments, we may store those waveforms in terms of source-filter model parameters.

Build the voice

Utterance structures

Pitch tracking

Join cost coefficients

Waveform representation