Prepare the input labels

The DNN needs frame-level numerical labels, so we must derive those from the Festival utterance structures.

We already have Festival utterance structures, from our unit selection voice. From these, we need to derive frame-level labels, time-aligned to the waveform. The procedure is quite similar to forced alignment for the unit selection voice. The differences are that we now want to attach all the linguistic contextual information to the phones (making them “full context labels”) and also that we want a state-level alignment, not just phone-level. We’ll use HMMs with 5 emitting states to obtain this alignment.

Convert utterance structures to full context labels
We will 'flatten' the utterance structures onto a linear sequence of context-dependent phone labels.
Forced alignment
The technique is similar to that for the unit selection voice, except that we have full context labels now.
Convert label files to numerical values
State-level full context labels are converted to frame-level numerical features.

Prepare the input labels

Convert utterance structures to full context labels

Forced alignment

Convert label files to numerical values

Search this site

Posts

Latest Activity