Prepare the input labels

The DNN needs frame-level numerical labels, so we must derive those from the Festival utterance structures.

We already have Festival utterance structures, from our unit selection voice. From these, we need to derive frame-level labels, time-aligned to the waveform. The procedure is quite similar to forced alignment for the unit selection voice. The differences are that we now want to attach all the linguistic contextual information to the phones (making them “full context labels”) and also that we want a state-level alignment, not just phone-level. We’ll use HMMs with 5 emitting states to obtain this alignment.