Collect and label the data (optional)

Supervised machine learning needs labelled data. The task of collecting and labelling this data is often overlooked in textbooks. Performing this step yourself is OPTIONAL, but you still need to understand the process.

This year (2023-24), the data collection step is OPTIONAL, to reduce workload. But you still need to understand how the data are created, and you will need to describe this in your report, so make sure you understand how you would collect data. You may want to create your own test set for certain experiments, so will need to come back to this later if you decide to do that.

You need data to train the HMMs, and some test data to measure the performance. Be very careful to keep these two data sets separate! Record the ten digits, and make sure you pronounce “0” as “zero”. Please record 7 repetitions of each digit for training, and a further 3 repetitions of each digit for testing.

Record all the training data into one long audio file, and all the test data into another. You should randomise the order in which you record the data (the reasons for doing this with the test data will become clear in the connected digits part, later). Remember to say the digits as isolated words with silence inbetween. I find it’s easier to write down some prompts on paper before recording.

Make a note of the device you used and edit the make_mfccs script to include this information. You should take a look at the existing info.txt file and use compatible notation :

  1. username
  2. gender (m or f)
  3. mobile phone make and model written in lowercase without spaces (e.g., apple_iphone_8, or samsung_galaxy_a10)
  4. your accent: native speakers should specify their accent, non-natives should specify “NN” plus their native country)

For example:

INFO=${USER}" m apple_iphone_8 NN France"

This allows your data to be combined with other peoples’ later on in the practical.

Your recording will probably be in mp3 format, so you will need to convert it to wav format before proceeding. The program sox can do this – ask a tutor for help in a lab session. Name the output files something like s1234567_train.wav and s1234567_test.wav .

Label your training data

You need to label the training data. Since the training data are in a single long wav file, there will be a single label file to go with it. There will be labels at the start and end of every recorded word. You don’t need to be very precise with these labels, but do be careful to include the entire word. It’s OK to include a little bit of preceding/following silence. It’s not OK to cut off the start or end of the word.

In this video:

  1. Start wavesurfer and load the file you wish to label (e.g., s1234567_train.wav) and select transcription as the display method. Create a waveform pane.
  2. You now have to label the speech. The convention is that labels are placed at the end of the corresponding region of speech. So, end all words with the name of the digit (use lowercase spelling: zero, one, two, …), and label anything else, including silence as junk. You must label the silence because these labels are use to determine the start times of the digit tokens. So, make sure that there is a junk label at the start of each word (the very first label in the file will therefore be junk).
  3. Right click in the labeling window and select properties in the window that opens: change the label file format to WAVES and click ok
  4. Right click in the labelling window again and select save transcription as to save your labels into a file called <username>_train.lab in the lab directory. During labelling, use save all transcriptions to save your work regularly.

NOTE: You can do this labelling in Praat instead, but you’ll have to convert the TextGrid to xwaves .lab format and also extract the individual wav files of the digits. See the .lab files in the assignment data.

Prepare your test data

We need isolated digit test data (each digit in a separate file) and can use Wavesurfer to split the waveform file into individual files, based on labels.

  1. Use wavesurfer to create labels as before. However, do not label the speech with the digit being spoken but instead use the labels <username>_test01 , <username>_test02, <username>_test03 and so on. Place one of these labels at the end of each spoken digit and place a label called junk at the start of every digit. As for the training data, leave a small amount of silence around each spoken digit.
  2. To split the file into individual files, right-click and select split sound on labels.
  3. After this, you should have lots of files called <username>_test01.wav, <username>_test02.wav, <username>_test03.wav and so on, in the wav directory. You will also have one or more files with junk in their names. You don’t need those and can delete them.

Some versions of Wavesurfer create a subdirectory containing the split files, and add a numerical prefix to them. If that happens, rename your files (e.g., manually, or using a script if you know how) so they are called something like s1234567_test01.wav and move them up to the wav directory.

Listen to some or all of the files to check they are OK.

Create a master label file for the test data

Although all our test data are now in individual files, for convenience we can use a single MLF (master label file) for the test data. This label file contains the correct transcription of the test data, which we will need when we come to measure the accuracy of the recogniser. You will need to listen to the test wav files in order to write this label file.

Use your favourite plain text editor, such as Atom, but not a word processor, to create this file.

It needs to look something like this (note the magic first line, and the full stop (period) marking the end of the labels for each file; make sure there is a newline after the final full stop in the file):

#!MLF!#
"*/s1234567_test01.lab"
zero
.
"*/s1234567_test02.lab"
nine
.

and save the file as lab/<username>_test.mlf

The format of the MLF file is as follows: The first line is a header: #!MLF!#. Following this are a series of entries of filenames and the corresponding labels. "*/s1234567_test01.lab" is the filename: make sure you include the quote marks and the * because these are important. After the filename follows a list of labels (there is only one label per file in our case) and then a line with just a full stop, followed by a newline, which denotes the end of the labels for that file.