Because DNN synthesis evolved from HMM synthesis, it is common to use HTS format label files to represent context-dependent phonemes.
After dumpfeats
we need to manipulate the format to make HTS format label files (which are an extended version of HTK labels). The scripts/utts_to_mlfs.sh
script is provided for running dumpfeats
and then performing all the subsequent steps. Take a few moments to understand what it is doing, then run it.
The script makes both monophone and full-context label files. We’ll use the monophone labels for doing forced alignment, and after that we will transfer the timestamps over to the full-context labels.
Can you explain why we cannot simply use the full-context labels for forced alignment?