We already have Festival utterance structures, from our unit selection voice. From these, we need to derive frame-level labels, time-aligned to the waveform. The procedure is quite similar to forced alignment for the unit selection voice. The differences are that we now want to attach all the linguistic contextual information to the phones (making them “full context labels”) and also that we want a state-level alignment, not just phone-level. We’ll use HMMs with 5 emitting states to obtain this alignment.
Convert utterance structures to full context labels
We will 'flatten' the utterance structures onto a linear sequence of context-dependent phone labels.
Forced alignment
The technique is similar to that for the unit selection voice, except that we have full context labels now.
Convert label files to numerical values
State-level full context labels are converted to frame-level numerical features.