This exercise assumes that you have already built your own unit selection voice, and therefore have all the data you need.
Tools required
Only needed if you are setting this exercise up on your own. My Edinburgh students can skip this step.
Prepare the input labels
The DNN needs frame-level numerical labels, so we must derive those from the Festival utterance structures.
Convert utterance structures to full context labels
We will 'flatten' the utterance structures onto a linear sequence of context-dependent phone labels.
Forced alignment
The technique is similar to that for the unit selection voice, except that we have full context labels now.
Convert label files to numerical values
State-level full context labels are converted to frame-level numerical features.
Faster, faster
Disk access can be a limiting factor, so using local disk is usually much faster than network disk.