Data

You can use data from other students in the class, and from previous years, to conduct your experiments.

The data from over 400 students is available for you to use in your experiments. You can find it in on the PPLS AT Lab servers in the directory: /Volumes/Network/courses/sp/data

In that directory, you will find a file called info.txt that describes the available data.

If you record your own data this year (which was optional), we can also (eventually) add that to the collection, and to info.txt. But, you don’t need to wait: all the data from previous years is already available.

The format of info.txt is quite simple. If you already have some coding experience, then you’ll probably want to automatically parse this file and automatically pull out subsets of speakers with the characteristics you want for each experiment. If you don’t have coding experience, try doing this in a spreadsheet.

The data are not perfect and you might want to think about how to detect bad data, so you can exclude those speakers from your experiments.

Many HTK functions use the -S option which allows you use input a script file: a text file that includes a list file names (1 per row). For example, using a script file will allow you to train your models on data from multiple speakers. See this thread for an example:

See this thread for an example of how to get test results from multiple speakers:

You can also read Atli Sigurgeirsson’s extremely helpful tutor notes for this part of the assignment.