Prepare the recordings

Move your recordings into the workspace, convert the waveforms to the right format, and do some sanity checking.

Studio recordings use a high sampling rate, perhaps as high as 96 kHz, which is unnecessarily high for the purposes of this exercise. Your recordings might also be at a high bit depth of 24 bits.

Before starting any processing, keep a backup of the original recordings somewhere safe, in case you make a mistake.

Place a copy of the recordings from the studio somewhere you can listen to them conveniently (e.g., your own laptop or a PPLS lab computer) and where you have the necessary tools available (e.g., sox)

Choose amongst the multiple takes

In the studio, you probably made multiple attempts at a few of the more difficult sentences. It’s likely that the last take is the one you want (your engineer will have kept notes to help you), so you can simply delete all earlier takes of each sentence. If absolutely necessary you will need to listen to multiple takes, then select the best one. Delete all unwanted takes.

The SpeechRecorder tool adds a suffix to the file basenames to indicate the ‘take’. You need to remove this so that the basenames exactly match the utterance identifiers in utts.data. Write a shell script to remove these suffixes (noting that the suffix might vary: “_1”, “_2”, etc. depending on which take you selected for each prompt).

Check

At this stage, you should have one wav file per line in your utts.data file. Listen to all the files (Mac tip: use the Finder, navigate with the cursor keys, and use the spacebar to play each file). If you find any mismatches with the text (e.g., a substituted word), then an expedient solution is to edit the text (and not to re-record the speech – do not be a perfectionist!). Ensure that the file naming exactly matches utts.data.

In general, do not re-record any utterances. A few missing utterances is not a major problem. Don’t be a perfectionist!

Downsample

Write a little shell script that downsamples all your recordings to 22.05 kHz and (if necessary) reduces bit depth to 16 bits. Here is how to do that with sox, for a single file:

bash$ sox recordings/arctic_a0001.wav -b16 -r 22050 wav/arctic_a0001.wav

Listen to a few files after downsampling, to check everything worked correctly.

Create a dataset

All you need to do now is create a dataset from your recordings. This simply comprises all the wav files and the utts.data file – copy these to ECDF so you can train a model on them.

CUI 2024 video available

Aliasing

Sampling and quantisation

Related forums

- Forum
- Topics
- Posts
- Last Post
- Signal processing
  Questions about feature extraction, time and pitch modification, or anything else we can do to speech waveforms.
- 46
- 152
- 1 year, 6 months ago
  Patricija B