At this point you have made your utts.mlf
file as well as the individual monophone and full-context label files. Although we could have recovered alignments from the unit selection voice, we won’t actually be using those here: we will re-align everything using HMMs with 5 emitting states (to provide finer sub-phonetic timing information to the DNN) and a different frame rate (to match that of the DNN).
Extracting MFCCs
This is exactly as for the unit selection voice, except that we’ll use a different configuration for HCopy that specifies a 5msec frame shift (instead of the 2msec shift used in the Multisyn voice build).
To do that, make sure your alignment/resources/CONFIG_for_coding
sets TARGETRATE=50000.0
.
Make sure your scripts/make_mfccs
includes the option -F 16000
to ch_wave
so that the waveforms are downsampled before the MFCCs are extracted. Now run the script
bash$ ./scripts/make_mfccs alignment data/wav/*.wav
Doing the alignment
First, make a list of all the MFCC files, then run the alignment script
bash$ cd alignment bash$ find `pwd`/mfcc -name *.mfcc | sort > train.scp bash$ ../scripts/do_alignment_dnn .
Transfer timestamps from monophones to full-context label files
At this point we have an MLF containing state-level alignments for monophone models (with 5 emitting states). We need to transfer the start and end times over to the full context label files that we made earlier. The scripts/transfer_times_to_full_context_labs.sh
script does this for you. Read it and understand what it does, then run it:
bash$ cd .. bash$ scripts/transfer_times_to_full_context_labs.sh
Sanity check the resulting full-context label files. They should like something like this
0 50000 xx~xx-#+#=#:xx_xx/A/0_0_0/B/xx-xx-xx:xx-xx&xx-xx#xx-xx$xx-xx>xx-xxxx-xx xx-xx xx-xx xx-xx xx-xx 0-2<0-4|oo/C/0+0+2/D/0_0/E/content+2:1+5&1+2#0+3/F/in_1/G/0_0/H/7=5:1=2&L-L%/I/7_3/J/14+8-2[2] #~#-oo+th=@:1_1/A/0_0_0/B/1-1-1:1-2&1-7#1-4$1-3>0-2<0-4|oo/C/0+0+2/D/0_0/E/content+2:1+5&1+2#0+3/F/in_1/G/0_0/H/7=5:1=2&L-L%/I/7_3/J/14+8-2 8650000 9550000 #~#-oo+th=@:1_1/A/0_0_0/B/1-1-1:1-2&1-7#1-4$1-3>0-2<0-4|oo/C/0+0+2/D/0_0/E/content+2:1+5&1+2#0+3/F/in_1/G/0_0/H/7=5:1=2&L-L%/I/7_3/J/14+8-2[3] ...etc
That's not a very human-friendly format, but you need to understand it. Go back and take another look at the scripts/utts_to_mlfs.sh
script and examine each stage in the processing. At the end of the 'for' loop the intermediate files are deleted. Comment that line out so that you can inspect those files. Look at the output of dumpfeats
and then try to understand how that is transformed by the awk
scripts into the HTS format label files.
Tip: make the script run on a single file whilst you are trying to understand how it works.