The initial phonetic sequence for forced alignment comes from Festival, by running the script through the front end. Remember to change unilex-rpx
everywhere, if you are using a different dictionary.
Creating the initial labels
bash$ festival $MBDIR/scm/build_unitsel.scm ./my_lexicon.scm festival>(make_initial_phone_labs "utts.data" "utts.mlf" 'unilex-rpx)
The output file utts.mlf
is created, which is an HTK master label file (MLF) containing the phonetic transcription of all the utterances; the labels are not yet time-aligned with the waveforms.
Tip: if you want to design your own script later, the above command is the easiest way to convert text into a phone sequence, so that you can measure the coverage.
Forced alignment involves training HMMs, just as in automatic speech recognition. Therefore, the speech has to be parameterised. The features we will use are MFCCs.
Extracting MFCCs
bash$ make_mfccs alignment wav/*.wav
Doing the alignment
bash$ cd alignment bash$ make_mfcc_list ../mfcc ../utts.data train.scp bash$ do_alignment .
(Notice the space and the period after the last command!)
The do_alignment
command will take a while to run (20 minutes or more) depending on the speed of the machine you are using and the amount of speech you recorded. Monitor it for the first 5 minutes or so to make sure there are no early problems.
Once the alignment has completed, you need to split the resulting MLF – which will now contain the correct time alignments for the labels – into individual label files that Festival can use.
Splitting the MLF file
bash$ cd .. bash$ mkdir lab bash$ break_mlf alignment/aligned.3.mlf lab
You can examine the label files at this point, but be careful not to change anything.
Optional variations
Skip this part during your first voice build, and come back later, when you are ready to create variations on the basic voice.
Modify the do_alignment script
Optionally, you can modify the do_alignment script, which will affect the quality of the forced alignment.