Forced alignment

The technique is similar to that for the unit selection voice, except that we have full context labels now.

At this point you have made your utts.mlf file as well as the individual monophone and full-context label files. Although we could have recovered alignments from the unit selection voice, we won’t actually be using those here: we will re-align everything using HMMs with 5 emitting states (to provide finer sub-phonetic timing information to the DNN) and a different frame rate (to match that of the DNN).

Extracting MFCCs

This is exactly as for the unit selection voice, except that we’ll use a different configuration for HCopy that specifies a 5msec frame shift (instead of the 2msec shift used in the Multisyn voice build).

To do that, make sure your alignment/resources/CONFIG_for_coding sets TARGETRATE=50000.0.

Make sure your scripts/make_mfccs includes the option -F 16000 to ch_wave so that the waveforms are downsampled before the MFCCs are extracted. Now run the script

bash$ ./scripts/make_mfccs alignment data/wav/*.wav

Doing the alignment

First, make a list of all the MFCC files, then run the alignment script

bash$ cd alignment
bash$ find `pwd`/mfcc -name *.mfcc | sort > train.scp
bash$ ../scripts/do_alignment_dnn .

Transfer timestamps from monophones to full-context label files

At this point we have an MLF containing state-level alignments for monophone models (with 5 emitting states). We need to transfer the start and end times over to the full context label files that we made earlier. The scripts/transfer_times_to_full_context_labs.sh script does this for you. Read it and understand what it does, then run it:

bash$ cd ..
bash$ scripts/transfer_times_to_full_context_labs.sh

Sanity check the resulting full-context label files. They should like something like this

0 50000 xx~xx-#+#=#:xx_xx/A/0_0_0/B/xx-xx-xx:xx-xx&xx-xx#xx-xx$xx-xx>xx-xxxx-xxxx-xxxx-xxxx-xxxx-xx0-2<0-4|oo/C/0+0+2/D/0_0/E/content+2:1+5&1+2#0+3/F/in_1/G/0_0/H/7=5:1=2&L-L%/I/7_3/J/14+8-2[2] #~#-oo+th=@:1_1/A/0_0_0/B/1-1-1:1-2&1-7#1-4$1-3>0-2<0-4|oo/C/0+0+2/D/0_0/E/content+2:1+5&1+2#0+3/F/in_1/G/0_0/H/7=5:1=2&L-L%/I/7_3/J/14+8-2
8650000 9550000 #~#-oo+th=@:1_1/A/0_0_0/B/1-1-1:1-2&1-7#1-4$1-3>0-2<0-4|oo/C/0+0+2/D/0_0/E/content+2:1+5&1+2#0+3/F/in_1/G/0_0/H/7=5:1=2&L-L%/I/7_3/J/14+8-2[3]
...etc

That's not a very human-friendly format, but you need to understand it. Go back and take another look at the scripts/utts_to_mlfs.sh script and examine each stage in the processing. At the end of the 'for' loop the intermediate files are deleted. Comment that line out so that you can inspect those files. Look at the output of dumpfeats and then try to understand how that is transformed by the awk scripts into the HTS format label files.

Tip: make the script run on a single file whilst you are trying to understand how it works.

Forced alignment

Extracting MFCCs

Doing the alignment

Transfer timestamps from monophones to full-context label files

Search this site

Posts

Latest Activity

Search the forums