Prepare the recordings

Move your recordings into the workspace, convert the waveforms to the right format, and do some sanity checking.

Studio recordings use a high sampling rate (48 kHz or 96 kHz); this is unnecessarily high for the purposes of this exercise. 16 kHz will suffice and will make the files smaller and more manageable. Keep backups of the original recordings somewhere safe, in case you make a mistake.

Start by copying the original waveforms from the studio into the ‘recordings’ folder.

Choose amongst the multiple takes

In the studio, you probably made multiple attempts at a few of the more difficult sentences. It’s likely that the last take is the one you want (your engineer will have kept notes to help you), so you can move the other takes to somewhere other than the ‘recordings’ folder.

Sanity check

Before proceeding, make sure that you then have exactly one wav file per line in your utts.data file. Listen to all the files (Mac tip: use the Finder, navigate with the cursor keys, and use the spacebar to play each file). If you find any mismatches with the text (e.g., a substituted word), then an expedient solution is to edit the text (and not to re-record the speech). Also make sure that the file naming exactly matches utts.data.

The SpeechRecorder tool adds a suffix to the file basenames to indicate the ‘take’. You need to remove this so that the basenames exactly match the utterance identifiers in utts.data. Write a script to remove these suffixes (noting that the suffix might vary: “_1”, “_2”, etc. depending on which take you selected for each prompt).

In general, you should not re-record any utterances. A few missing utterances is not a major problem, for the purposes of this exercise.

Downsample

Here’s how to downsample a single file, and save it in the required RIFF format:

bash$ ch_wave -otype riff -F 16000 -o wav/arctic_a0001.wav recordings/arctic_a0001.wav

and you need to write a little shell script that does this to all the files in your ‘recordings’ folder. If you happened to record your data at 24bits instead of 16bits, you’ll need to use sox to change the bit depth, and you can downsample at the same time using sox instead of ch_wave:

bash$ sox recordings/arctic_a0001.wav -b16 -r 16k wav/arctic_a0001.wav

Now listen to a few of the files after downsampling, to check everything worked correctly.

  • Endpointing

    If you have excessive silences at the start or end of many of your recordings, you might want to endpoint them. Only try this if your forced alignment does not give good results.

Related posts

Related forums