Read all the instructions right the way through before you start: from the introduction right through to “digit sequences”.
Remember that the milestones will help you stay on track with this assignment.
Log in- IntroductionAn overview of this assignment
Speech Processing Assignment 2
Due date (2024-25 session): Thursday 5 December 2024, 12 noon. Submit your report on Learn.
What you will do
The goal of this exercise is to learn how an HMM-based Automatic Speech Recognition (ASR) system works, by building one. You will go through the entire processing, starting with data wrangling, model training, and finishing with calculating the Word Error Rate (WER) of the system(s) you have built.
Initially, you’ll build the simplest possible speaker-dependent system using data from just one person. In the first instance, this will be from a speaker in our existing database, and later you can (optionally) try building one based on your own voice. In doing this, you will get a basic understanding of each step in the process.
Then you’ll build speaker-independent systems that would actually have real-world applications. In this part of the exercise, you will design and execute a number of your own experiments, to deepen your understanding of ASR generally, and more specifically of what factors affect the WER.
While this assignment will primarily focus on isolated digit recognition, you can further your understanding by extending your system to recognise digit sequences by writing a new language model.
What you will submit
Your assignment submission will be a lab report that introduces your experiments, establishes background on HMM-based ASR, establishes hypotheses, and tests these hypotheses with in experiments. You should also discuss the overall findings and implications of your results and draw appropriate conclusions. See write up instructions for more details.
The word limit for the report is 3000 words.
Practical Tasks
- Get started with the assignment materials and access to HTK on the PPLS AT lab computers (watch the overview video!)
- Build a speaker-dependent digit recognition system using a series of bash shell scripts
- See lab materials from the “Intermission” (week 7) for pointers on shell scripting
- Build a speaker-independent digit recognition system, extending the scripts you used for the speaker-dependent system
- Develop hypotheses about what training and testing configurations may improve or harm ASR performance for speaker-independent ASR
- These could be due to issues with the data or the model setup, or both.
- There are some experiments suggested in the assignment webpages, but you don’t have to stick to those!
- Design experiments to test these hypotheses using an existing ‘messy’ data set of previous students’ recordings
- Build and test a digit sequence recogniser (optional/bonus marks)
- Write up a lab report of your experiments
We will go over the theory behind HMM-based ASR in modules 7-10 of the course. So, you may find yourself running scripts for components you don’t fully understand yet when you start the assignment. This is ok! In the first labs, the main focus is on using scripting to build an ASR pipeline. We’ll go over more details through the next weeks.
How to submit your assignment
You should submit a single written document in pdf format for this assignment – see the pages on writing up for detailed instructions on what to include and how your report should be formatted.
Please be sure to read submission guidance on the PPLS hub.
You must submit your assignment via Turnitin, using the appropriate submission box on the course Learn site. There are separate submission links for the UG and PG version of the course (but they are both linked on the same Learn site)
You can resubmit work up until the original due date. However, if you are submitting after the due date (e.g., you have an extension) you will only be able to submit once – so make sure you are submitting the final version! If in doubt, please ask the teaching office and again, check the guidance on the PPLS hub.
Extensions and Late submissions
Extensions and special circumstances are handled by a centralised university system. Please see more details here: Extensions and Exceptional Circumstance Guidance. This means that the teaching staff for a specific course have no say over whether you can get an extension or not. If you have worries about this please get in touch with the appropriate teaching office (UG, PG) and/or your student adviser.
How will this assignment be marked?
For this assignment, we are looking to assess your understanding of HMM-based ASR works. The report should not be a discursive essay, nor merely documentation of the HTK functions that were used.
You will get marks for:
- Completing all parts of the practical, and demonstrating this by completing all requested sections of the report
- Writing a clear and concise background section (using figures as appropriate) that shows that you understand the differences between HMM-based ASR in theory and the actual implementation used in practice for this assignment.
- Establishing clear and testable hypotheses, explaining why (or why not) you think these hypotheses will hold with specific system configurations and datasets.
- You should use citations to support your claims from the course materials (and potentially your own literature review).
- You should draw on what you’ve learned about phonetics and signal processing to understand why different data subsets may produce different results.
- Designing and implementing experiments that test those hypotheses, and make the best use of the (limited, messy) data available to you.
- Clearly presenting the results of the experiments (using tables, figures and text descriptions) and linking those results back your hypotheses.
- Discussing the implications and limitations of your experiments.
- Drawing conclusions with an appropriate level of certainty. Can you be sure that your conclusions are correct? Do you need to do more experiments? It is better to explain your uncertainty than falsely project confidence!
- You will also be marked based on the overall coherence and the strength of your argumentation – are your claims well supported by the results and citations were appropriate?
If you do that (i.e., complete all the requested sections of the report), you will likely to get a mark in the good to very good range (as defined by the University of Edinburgh Common Marking Scheme).
You can get higher marks by adding some more depth to your analysis in each section, but particularly the experimental sections. Can you do additional experiments that shed further light on your results and provide further evidence for (or against) your hypotheses?
As usual, we have a positive marking scheme. You will get marks for doing things correctly. If you write something incorrect, it won’t count towards your mark total but you won’t be marked down (in effect, it will be ignored). You will NOT get marked down for typos or grammatical errors, but do try to proofread your work – help your markers to give you the credit you deserve!
Tips for success
Read all the instructions right the way through before you start. You are given all the scripts for the basic part of the practical, so your work is mainly extending existing scripts, preparing data and running experiments. For the speaker-independent and digit sequence recognisers, relatively small changes will be required, starting from copies of the basic scripts. Don’t underestimate how long data cleaning/wrangling can take for your actual experiments.
We suggest that you use a log book to take notes: record each step you perform, so that you have a record for writing your report. Record the experimental conditions and the results carefully. If you modify scripts, record what you did (preferably, make a new script with a meaningful name each time). Be careful and methodical – in the later stages, pay particular attention to which speakers you are training and testing on. Make sure that every result is easily reproducible, simply by running the appropriate script(s).
Working with other students (and machines)
We encourage you to work with each other in the lab, to solve problems you encounter during the exercise – this is a positive thing for both the person doing the helping and the person being helped. Do not cross the line between positive, helpful interaction and cheating.
Note: It’s ok for you to work in pairs and help each other with coding but you must write up your reports independently.
If you do work with someone else, either in scripting or in designing experiments you must note that (including their exam number) in the introduction of your report.
It’s good to recognize that collaboration is an essential part of speech and language technology nowadays. Nobody does everything by themselves. However, if you let someone else do a lot of the technical work now, you may find it puts you back a step later, especially if you are intending to do more study or work in speech and language technology.
You can only really learn to do this sort of practical work by doing it.
If you do work with others, spread the load between you in terms of coding so you all get a chance to practice!
These things are definitely ok and, in fact, are encouraged:
- teaching someone how to write shell scripts, in general terms
- helping to spot a minor error in someone else’s script
- explaining what an HTK program does
- discussion of the theory
There’s an old saying that you never really understand something until you teach it. Try that with your fellow students.
These are some of the things you should not be doing:
- writing a complete shell script for someone else when they have no idea how it works
- copying parts of someone else’s script without understanding what it does
- presenting results in your report that you did not help to obtain yourself
Again, you must write up your reports as independently. So, the following are not ok:
- helping someone with the content of the written report
- reading anyone else’s written report before submission
- showing your written report to anyone else before submission
- working with someone else, but not mentioning that you did so in your report
Use of grammar checkers is allowed under university regulations, just as it is fine to ask someone else to proofread your work. However, you should NOT use automated tools such as ChatGPT to generate text for your report (just as it is NOT ok to get someone else to write your report for you). If want to try to use ChatGPT to help learn concepts via dialogue, just be aware that it still often gets very basic things wrong in this area. Please see the course guidance on the use of AI in assignments. If you use generative AI tools in your work, you must declare it! How to get help
The first port of call for help is the course labs. The labs for the rest of the semester (modules 7-10) will be devoted to this assignment. It’s generally easier to troubleshoot issues with HTK and shell scripting in person, so do make use of the labs! It’s fine to come to either the 9am or 4pm Wednesday labs (or both as long as there’s room!) but you will find that the 9am lab is quieter so you will get more attention than in the 4pm lab.
Teaching staff will give overviews or demonstrations at the beginning of each lab, so come on time.
You can also ask questions on the speech.zone forum. The forum is the preferred method for getting asynchronous help, as it means that any clarifications or hints are shared with the whole class. You will need to be logged into speech.zone in order to see the assignment-specific forums.
You can also come to office hours or book 1-1 appointment with the lecturers, or send us an an email – see contact details in Module 0.
- HTK essentialsThis is a widely-used toolkit in automatic speech recognition research.
We will be assuming you have version 3 of HTK, although everything should work with any recent version.
All HTK commands start
H
, for example,HList
is a tool to list the contents of data files (waveforms, MFCCs, etc.). The tools use command line flags and arguments extensively, as well as configuration files. Command line flags in capital letters are common across tools, such as:-T 1
specifies basic tracing-C resources/CONFIG
tells the tool to load the config file calledCONFIG
in the directoryresources
Simple scripts are provided for building a basic speaker-dependent digit recogniser from your own data. You will need to modify them slightly to make the more advanced system later. You will need to refer to the HTK manual.
HTK is not open source, so the only way to obtain the manual for HTK is to register on the HTK website. Do that now. You don’t need to read the manual yet, but it will be useful later.
To modify the scripts you’ll need to use a text editor, such as Atom or emacs. Never use a word processor!
Everywhere in this handout that you see
<username>
, you should replace it with your username. For examplelab/<username>_test.mlf
would becomelab/s1234567_test.mlf
.File formats can quickly get confusing. There are dozens of waveform formats, and various formats for model parameters, MFCCs and so on. We will record the speech data into Microsoft Wav (RIFF) format waveforms, which will have the extension .wav, for example.
Please stick to the recommended names of files and directories – it will make it easier for me to help you (Edinburgh students only).
Related forum
- This forum has 14 topics, 52 replies, and was last updated 1 year ago by .
Viewing 14 topics - 1 through 14 (of 14 total)-
- Topic
- Voices
- Last Post
Viewing 14 topics - 1 through 14 (of 14 total)- You must be logged in to create new topics.
- Getting StartedHere are the files you need to get started and tips for how/where to do the assignment
Overview video
The handout mentioned in the video has been replaced by these web pages. The videos are made on a Mac, but you will be working on Linux.
In this video:
- Read all the instructions (on these webpages) before you start
- The folder structure that is provided as a starting point
- The ‘recipe’ for creating an ASR system
- collect training and testing data (optional)
- label the data (optional)
- parameterise the waveform files as MFCCs (optional)
- initialise and train one HMM for each digit class
- make a simple grammar
- test your system
In this assignment you will work with the HTK hidden Markov model toolkit. This is a powerful and general purpose toolkit from Cambridge University for speech recognition research, but we won’t worry about many of the more advanced features on this exercise.
How to access HTK and the assignment materials
In the PPLS AT lab or via remote desktop
If you are working in the PPLS AT lab, you can use the version of HTK that is already installed. All you need to do is copy the provided scripts to your own working directory, so that you can edit and extend them yourself (See ‘Getting the scripts’ below)
Connecting via ssh
If you’re finding the remote desktop is slow and you’re comfortable using the terminal, you can technically do everything via
ssh
as you don’t have to listen to anything through HTK or use the GUI. Using the terminal (or powershell/PuTTy/Ubuntu terminal on Windows), you can connect using:ssh your_uun@[AT lab machine name].ppls.ed.ac.uk
where you substitute [AT lab machine name] for one of the remotely accessible machines here.
You can open multiple terminals and ssh instances at once. This can be helpful if you are editing code in one terminal and running it in another.
Installing HTK on your own computer
If you want to try installing HTK on your own computer, there are some tips on this page with instructions for downloading the data.. You can also look on the assignment forums, but our ability to support self installs is limited – success has been variable, especially with newer laptops. If you have a good internet connection, there isn’t that much to gain from installing HTK yourself versus working on the PPLS AT lab servers remotely (e.g. through an ssh connection).
NOTE: From now on, the instructions generally assume you are using the PPLS AT lab computers in-person or using the remote desktop.
Getting the scripts
Everyone will need to get a copy of the assignment scripts. If you using the PPLS AT lab computers (remote or in-person), you can get the assignment scripts by running the following commands in the terminal (assuming you’ve already made the folder ~/Documents/sp)
cd ~/Documents/sp ## Don't forget the dot "." at the end of the next line! cp -R /Volumes/Network/courses/sp/digit_recogniser . cd digit_recogniser
You’ve now copied the scripts you need for the assignment. The data that these scripts use is on the servers in
/Volumes/Network/courses/sp/data
If you are working on the PPLS AT lab computers, you don’t have to copy over the data (e.g. feature and label files), so there’s nothing else you need. If you install HTK yourself you will need to download the data (see the tips below).
Now, you can go on to building your first speaker dependent ASR system!
Log in- Installing HTKSome tips for installing HTK
Installing HTK on your own computer
We really do not recommend installing HTK yourself unless you’re familiar with compiling code. It’s best to use the lab computers (remote or in person) or ssh. The code is quite old and not maintained, so you’re likely to run into dependency issues. The install instructions haven’t been tested recently.
If you do want to try installing HTK on your own computer, here are some instructions:
- instructions for installing HTK on a MacBook.
- Instructions for installing HTK on Ubuntu Linux.
- Instructions for installing HTK on Ubuntu WSL
Those gists also include info on how to download the assignment data. I was only able to test these on somewhat older machines, so there’s a high probability that new machines like the Mac M1 will run into problem.
Downloading the scripts and data to your computer
In general, you can download the data from the PPLS AT lab servers using
rsync
as follows. You could also use a file transfer app like FileZilla or WinSCP, you just need to navigate to the corresponding directories.You don’t need to download any data if you are working on the PPLS AT lab machines (in-person or remote!)
On your own computer, navigate to the directory you want to download it to. The following assumes you’ll just use the same directory structure as the instructions from the PPLS AT lab remote desktops (replacing
YOUR_UUN
with your actual UUN:mkdir -p ~/Documents/sp rsync -avz YOUR_UUN@scp1.ppls.ed.ac.uk:/Volumes/Network/courses/sp/digit_recogniser ~/Documents/sp
This will copy the scripts in the
digit_recogniser
directory on the AT lab servers to the directoryDocuments/sp
in your home directory (remember~
is a shortcut for your home directory in unix command line).Now let’s get the data directory (e.g. features and labels from recordings):
mkdir -p ~/Documents/sp rsync --exclude 'wav' -avz YOUR_UUN@scp1.ppls.ed.ac.uk:/Volumes/Network/courses/sp/data ~/Documents/sp
This will exclude the `wav` directory which as you might expect includes the recordings that features were extracted from. It’s around 4 GB. If you want to download this data too, you can just remove the `–exclude ‘wav’` flag.
Assuming you downloaded the data into
~/Documents/sp
, you can go to the directory with the digit recognizer scripts from your home directory:cd ~/Documents/sp/digit_recogniser
To go to the directory with the data from your home directory:
cd ~/Documents/sp/data
After you have completed the assignment, you must delete any data that you downloaded. You only have permission to use it for the Speech Processing course, and no other purpose.
- Speaker-dependent systemThe main task here is gathering the data. After that, just run the provided scripts.Log in
- Collect and label the data (optional)Supervised machine learning needs labelled data. The task of collecting and labelling this data is often overlooked in textbooks. Performing this step yourself is OPTIONAL, but you still need to understand the process.
This year (2024-25), the data collection step is OPTIONAL to reduce workload. But you still need to understand how the data are created, and you will need to describe this in your report, so make sure you understand how you would collect data. You may want to create your own test set for certain experiments, so will need to come back to this later if you decide to do that. Note, you won’t get more marks for simply recording and labelling the data. Instead think about how you can use any new recordings to extend and deepen your experiments and analysis.
You need data to train the HMMs, and some test data to measure the performance. Be very careful to keep these two data sets separate! Record the ten digits, and make sure you pronounce “0” as “zero”. Please record 7 repetitions of each digit for training, and a further 3 repetitions of each digit for testing.
Record all the training data into one long audio file, and all the test data into another. You should randomise the order in which you record the data (the reasons for doing this with the test data will become clear in the connected digits part, later). Remember to say the digits as isolated words with silence inbetween. I find it’s easier to write down some prompts on paper before recording.
Make a note of the device you used and edit the
make_mfccs
script to include this information. You should take a look at the existing info.txt file and use compatible notation :- username
- gender (m or f)
- mobile phone make and model written in lowercase without spaces (e.g., apple_iphone_8, or samsung_galaxy_a10)
- your accent: native speakers should specify their accent, non-natives should specify “NN” plus their native country)
For example:
INFO=${USER}" m apple_iphone_8 NN France"
This allows your data to be combined with other peoples’ later on in the practical.
Your recording will probably be in mp3 format, so you will need to convert it to wav format before proceeding. The program
sox
can do this – ask a tutor for help in a lab session. Name the output files something like s1234567_train.wav and s1234567_test.wav .Label your training data
You need to label the training data. Since the training data are in a single long wav file, there will be a single label file to go with it. There will be labels at the start and end of every recorded word. You don’t need to be very precise with these labels, but do be careful to include the entire word. It’s OK to include a little bit of preceding/following silence. It’s not OK to cut off the start or end of the word.
In this video:
- Start wavesurfer and load the file you wish to label (e.g., s1234567_train.wav) and select
transcription
as the display method. Create a waveform pane. - You now have to label the speech. The convention is that labels are placed at the end of the corresponding region of speech. So, end all words with the name of the digit (use lowercase spelling:
zero
,one
,two
, …), and label anything else, including silence asjunk
. You must label the silence because these labels are use to determine the start times of the digit tokens. So, make sure that there is ajunk
label at the start of each word (the very first label in the file will therefore be junk). - Right click in the labeling window and select
properties
in the window that opens: change the label file format toWAVES
and clickok
- Right click in the labelling window again and select
save transcription as
to save your labels into a file called<username>_train.lab
in the lab directory. During labelling, usesave all transcriptions
to save your work regularly.
NOTE: You can do this labelling in Praat instead, but you’ll have to convert the TextGrid to xwaves .lab format and also extract the individual wav files of the digits. See the .lab files in the assignment data.
Prepare your test data
We need isolated digit test data (each digit in a separate file) and can use Wavesurfer to split the waveform file into individual files, based on labels.
- Use wavesurfer to create labels as before. However, do not label the speech with the digit being spoken but instead use the labels
<username>_test01 , <username>_test02, <username>_test03
and so on. Place one of these labels at the end of each spoken digit and place a label calledjunk
at the start of every digit. As for the training data, leave a small amount of silence around each spoken digit. - To split the file into individual files, right-click and select
split sound on labels
. - After this, you should have lots of files called
<username>_test01.wav, <username>_test02.wav, <username>_test03.wav
and so on, in thewav
directory. You will also have one or more files withjunk
in their names. You don’t need those and can delete them.
Some versions of Wavesurfer create a subdirectory containing the split files, and add a numerical prefix to them. If that happens, rename your files (e.g., manually, or using a script if you know how) so they are called something like
s1234567_test01.wav
and move them up to thewav
directory.Listen to some or all of the files to check they are OK.
Create a master label file for the test data
Although all our test data are now in individual files, for convenience we can use a single MLF (master label file) for the test data. This label file contains the correct transcription of the test data, which we will need when we come to measure the accuracy of the recogniser. You will need to listen to the test wav files in order to write this label file.
Use your favourite plain text editor, such as Atom, but not a word processor, to create this file.
It needs to look something like this (note the magic first line, and the full stop (period) marking the end of the labels for each file; make sure there is a newline after the final full stop in the file):
#!MLF!# "*/s1234567_test01.lab" zero . "*/s1234567_test02.lab" nine .
and save the file as
lab/<username>_test.mlf
The format of the MLF file is as follows: The first line is a header:
#!MLF!#
. Following this are a series of entries of filenames and the corresponding labels."*/s1234567_test01.lab"
is the filename: make sure you include the quote marks and the*
because these are important. After the filename follows a list of labels (there is only one label per file in our case) and then a line with just a full stop, followed by a newline, which denotes the end of the labels for that file. - Parameterise the data (optional)Our HMMs do not work directly with waveforms, but rather with features extracted from those waveforms. Performing this step yourself is OPTIONAL, but you still need to understand the process.
This year (2024-25) this data collection and feature extraction steps are OPTIONAL. Do not execute any of the commands in this section unless you have recorded your own data. However, you still need to understand the feature extraction process, so you can describe it in your report.
Each waveform file must be parameterised as MFCCs. This is done using the HTK command
HCopy
. The file calledCONFIG_for_coding
file, which is provided, specifies the settings used for the MFCC analysis. You should keep these settings for this assignment.Do not run
HCopy
by hand – instead, you should use themake_mfccs
script. The script runsHCopy
like this:HCopy -T 1 -C resources/CONFIG_for_coding wav/file.wav mfcc/file.mfcc
Run the
make_mfccs
script as soon as you have finished preparing and checking the data – it will copy your data and labels into a shared directory, making it available to the rest of the class.If you make any changes to your data (e.g., correcting a label) then you must run the script again.
Make sure you are in the
digit_recogniser
subdirectory, then run:./scripts/make_mfccs
Scroll back up through the output from this command to see if there were any errors. If there were, correct the problems (e.g., move files to the correct places) and run the command again.
Sharing your data
The
make_mfccs
script copies your data to a shared folder for use by other students. If you discover and fix any errors in your data, you should re-run this script.Related forum
- This forum has 26 topics, 36 replies, and was last updated 1 year, 11 months ago by .
-
- Topic
- Voices
- Last Post
- You must be logged in to create new topics.
- Train the acoustic modelsWe will used supervised machine learning (including the Baum-Welch algorithm) to train models on labelled data.
This year (2024-25), the preceding data collection steps are optional and should be skipped initially. Start here, at this step, to build your first speaker-dependent digit recognizer.
The training algorithms for HMMs require a model to start with. In HTK this is called the prototype model. Its actual parameter values will be learned during training, but the topology must be specified by you, the engineer. This is a classic situation in machine learning, where designing or choosing the type of model is made using expert knowledge (or perhaps intuition).
Select your prototype HMMs
In this video:
- We need a model to start with – in HTK this is called a “prototype model”
- A prototype model defines the:
- dimensionality of the observation vectors
- type of observation vector (e.g., MFCCs)
- form of the covariance matrix (e.g., diagonal)
- number of states
- parameters (mean and variance) of the Gaussian probability density function (pdf) in each state
- topology of the model, using a transition matrix in which zero probabilities indicate transitions that are not allowed and never will be, even after training
You can experiment with different topologies – although for isolated digits, the only sensible thing to vary is the number of states. Models with varying numbers of states are provided in
models/proto
(remember, a 5 state model in HTK actually has only 3 emitting states). In your later experiments, modify theinitialise_models
script to try different prototype models. You might even want to create additional prototype models.Now train the models
In this video:
- Training consists of two stages that we’ll look at in more detail in modules 9 and 10:
- the model parameters are initialised using
HInit
which performs a simple alignment of observations and states (uniform segmentation), followed by Viterbi training - then
HRest
performs Baum-Welch re-estimation
- the model parameters are initialised using
- A close look at the
initialise_models
script
In the simple scripts that you have been given as as starting point, each of the two stages of training is performed using a separate script. You can run them now:
./scripts/initialise_models ./scripts/train_models
Note: If you haven’t recorded your own data, running these scripts is going to throw an error. Can you see why from the error message? You can fix this by editing the scripts to use data from the user
simonk
rather than calling the command `whoami`. We’ll go over this in the lab, but there’s an example of this in Atli Sigurgeirsson’s extremely helpful tutor notes for this part of the assignment.In your later experiments, you will want to automate things as much as possible. You could combine these two steps into a single script, or call them in sequence from a master script.
- Language modelEven for isolated digits, we need a language model to compute the term P(W).
We’ll see in modules 9-10 that a finite state language model very easy to combine with HMMs. In your report, explain why this is.
The language model computes the term P(W) – the probability of the word sequence W. We’re actually going to use what appears to be a non-probabilistic model, in the form of a simple grammar. To think of this in probabilistic terms, we can say that it assigns a uniform probability to all allowed word sequences, and zero probability to all other sequences.
Why do we need a model of word sequences when we are doing isolated digit recognition?
HTK has tools to help write grammars manually, then convert them into finite state models. A grammar for isolated digits, and the corresponding finite state model, is provided for you in the
resources
directory, calledgrammar
; have a look at it. The finite state model is ingrammar_as_network
.The HTK manual contains all the information you need to understand the simple isolated word grammar, and how to extend that to connected digits.
Later, if you do the connected digits experiment, you will write a new grammar, from which you can automatically create a finite state network version using the
HParse
tool. - EvaluationBy comparing the recogniser's output with the hand-labelled test data, we can compute the Word Error Rate (WER).
Now we are ready to run the recogniser on some test data. You should run this on some existing isolated test digits (one digit per wav file) and the MFCC files from them. We run the Viterbi decoder using HVite. The script
recognise_test_data
does this. The output is stored in therec
directory. Look at the recogniser output and compare it to the correct answers. Calculate the Word Error Rate (WER) using theresults
script.Again, you’ll need to edit the scripts to use a specific user (e.g.
simonk
) and the fulldata
directory, rather than`whoami`
and thedata_upload
directory.
- Collect and label the data (optional)
- Speaker-independent systemsIt's time to make the system more useful for a real application.Log in
- DataYou can use data from other students in the class, and from previous years, to conduct your experiments.
The data from over 400 students is available for you to use in your experiments. You can find it in on the PPLS AT Lab servers in the directory:
/Volumes/Network/courses/sp/data
In that directory, you will find a file called
info.txt
that describes the available data.If you record your own data this year (which was optional), we can also (eventually) add that to the collection, and to
info.txt
. But, you don’t need to wait: all the data from previous years is already available.The format of
info.txt
is quite simple. If you already have some coding experience, then you’ll probably want to automatically parse this file and automatically pull out subsets of speakers with the characteristics you want for each experiment. If you don’t have coding experience, try doing this in a spreadsheet.The data are not perfect and you might want to think about how to detect bad data, so you can exclude those speakers from your experiments.
Many HTK functions use the
-S
option which allows you use input a script file: a text file that includes a list file names (1 per row). For example, using a script file will allow you to train your models on data from multiple speakers. See this thread for an example:See this thread for an example of how to get test results from multiple speakers:
You can also read Atli Sigurgeirsson’s extremely helpful tutor notes for this part of the assignment.
- Skills you will needTo perform your more advanced experiments, it's worth mastering a few new skills. They will help you run more, and better, experiments.
Experimental design
You need to think carefully about each experiment, and what it is designed to discover. Do not simply run random, ill-thought-out experiments and then try to make sense of the results. Instead, form a hypothesis, and express it as a falsifiable statement, like this:
When training models only on speaker A, the word error rate will be lower for speaker A’s test data, than for test data from speaker B.
and then design an experiment to test whether that is true. In the report you should also take care to explain why you formulated the hypothesis the way you did (i.e., why did you think it might be true? Do you have evidence from the literature? Observations of other speakers?)
The key to a good experimental design here is to control any factors that you are not interested in, and only investigate a single factor (e.g., speakers’ gender) per experiment.
Shell scripting
A few fairly simple shell scripting techniques will help you enormously here. You can completely automate your experiments. This not only makes them more efficient for you, it also makes it easier to reproduce your results.
There are several resources in the “Intermission” lab tab if you need some help getting started with the shell, particular this intro to shell scripting. You may also find this bash primer by Jason Fong helpful!
- This forum has 18 topics, 37 replies, and was last updated 1 year, 11 months ago by .
-
- Topic
- Voices
- Last Post
- You must be logged in to create new topics.
- Suggested experimentsThese example experiments are just to get you thinking. You should devise interesting experiments of your own.
To get started, use the already-trained models from the speaker-dependent experiment.
If you use models trained only on your own speech to recognise the Test set of another speaker, what Word Error Rate do you expect?
That will give you some clues, but isn’t a very sophisticated experiment: no real-world system would attempt to do that. Your main experiments should investigate speaker-independent systems trained and tested on larger numbers of speakers.
The testing speakers must be distinct from the training speakers. You need to control all factors that are not of interest: (including: accent, gender, microphone type, amount of training data). Here are some possible experiment designs:
- The effect of gender, with simplistic control over accent and microphone
- Training set: the training data of 20 female UK English speakers using headset microphones
- Test set A: the test data of 20 female UK English speakers not in the training set, also using headset microphones
- Test set B: the test data of 20 male UK English speakers (obviously not in the training set), also using headset microphones
- The effect of gender, with more sophisticated control over accent and microphone (version 1)
- Training set A: the training data of 50 female speakers, with a mixture of accents and microphones
- Training set B: the training data of 50 male speakers, with a mixture of accents and microphones in the same proportions as training set A
- Test set: the test data of 50 female speakers not in training set A, with a mixture of accents and microphones in the same proportions as training set A
- The effect of gender, with more sophisticated control over accent and microphone (version 2)
- Training set: the training data of 50 female speakers with a mixture of accents and microphones
- Test set A: the test data of 50 female speakers not in the training set, with a mixture of accents and microphones in the same proportions as the training set
- Test set B: the test data of 50 male speakers, with a mixture of accents and microphones in the same proportions as the training set
Some of these designs are better than others. Can you work out the pros and cons of each design?
What effect does microphone type have?
Design an experiment to discover whether the microphone type is important. This might involve discovering if some microphones give lower Word Error Rate than others, or finding out the effect of mismatches between the Training and Test sets. Remember to control all the other factors.
You can perform equivalent experiments to investigate the gender and accent factors too.
What effect does the amount of training data have?
In machine learning, it’s often said that more training data is better. But is that always the case? Design some experiments to explore this. Include cases where the Training and Test sets are well-matched (e.g., in gender and/or accent, etc) and cases where there is mismatch. What is more important: matched training data, or just more data?
These questions are very important in commercial systems: it costs a lot of money to obtain the training data, so we want to collect the most useful data we can.
- The effect of gender, with simplistic control over accent and microphone
Use the forums to discuss your ideas for more advanced experiments. You’ll need to log in to access the forum for this exercise.
Private
- You do not have permission to view this forum.
- Data
- Digit sequences (optional)This is not as hard as you might think. If you attempt this, restrict your experiments to a speaker-dependent system using either your own data (if you collected some) or that from a single existing speaker in the database.
This is a bit harder than isolated digits, but not much. The key is to realise that there may be silence between the words, so you will need an acoustic model (i.e., an HMM) for that. Hint: you labelled the silence earlier as
junk
.You’ll also need to write a new language model in the form of an HTK grammar (hint: see the HTK manual) and then convert it to the finite-state format that
HVite
needs, like this:$ HParse resources/digit_sequence_grammar resources/digit_sequence_grammar_as_network
Evaluating the output of the recogniser is no longer so easy – it might insert or delete digits, as well as substitute incorrect digits. You can use your existing training data, but you’ll need to make a different test set, containing digit sequences, and a corresponding label file.
How to do this is described in the HTK manual (3.4.1), Chapter 12: Networks, Dictionaries and Language Models. Specifically, have a look at section 12.3.
You can also look at the HTK manual for
HResults
– there are some useful options for showing more detail of the scoring procedure, such as the flag-t
Some helpful forum posts:
- https://speech.zone/forums/topic/finding-users-with-digit-sequences
- https://speech.zone/forums/topic/how-do-i-get-the-results-for-multiple-speakers-at-once/
- https://speech.zone/forums/topic/recognising-digit-sequences/
Remember you don’t want to count
junk
labels when scoring! (See last forum post above) - Writing upYou're going to write up a lab report, that describes how HMM-based ASR works, and reports your experiments.Log in
- Lab reportWrite up your findings from the exercise in a lab report, to show that you can make connections between theory and practice.
You should write a lab report about this speech recognition practical. Keep it concise and to the point, but make sure you detail your findings using HTK, including both the theory and how it is implemented in practice. You should also report your experimental work, clearly explaining your experimental design and your results. Use tables and graphs to present the results. Do not cut and paste Terminal output!
What exactly is meant by “lab report”?
It is not a discursive essay. It is not merely documentation of commands that you ran and what output you got. It is a factual report of what you did in the lab that demonstrates what you learnt and how you can relate that to the theory from lectures. You will get marks for:
- completing all parts of the practical, and demonstrating this in the report
- a clear demonstration that you understand what each HTK tool used does, and that you can link that to the underlying theory
- clear and concise writing, and effective use of diagrams, tables and graphs
How much background material should there be?
Do not spend too long simply restating material from lectures or textbooks without telling the reader why you are doing this.
Do provide enough background to demonstrate your understanding of the theory, and to support your explanations of how HTK works and your experiments. Use specific and carefully chosen citations. Cite textbooks and papers. You will get more marks if you cite better and more varied sources (e.g., going beyond the essential course readings). If you only cite from the main textbook, this will not get you top marks. Avoid citing lecture slides or videos, unless you really cannot find any other source (which is unlikely). Make sure everything you cite is correctly listed in the bibliography.
Writing style
The writing style of the report should be similar to that of a journal paper. Don’t list every command you typed! You do not need to include your shell scripts. Use diagrams to illustrate your report, and tables and/or graphs to summarise your results. Do not include any verbatim output copied from the Terminal: you will not receive any marks for this.
Additional tips
You do not need to list the individual speakers in each of your data sets, but do carefully describe the data (e.g., “20 male native speakers of Scottish English using laptop microphones”). You might use tables to present this in a compact form, and perhaps gives short names or acronyms to each set, such as “20-male-SC-laptop”.
- Formatting instructionsSpecification of word limits and other rules that you must follow, plus the structured marking scheme.
Please be sure to check general turnitin submission guidance on the PPLS hub.
You must:
- submit a single document in PDF format. When submitting to Learn, the electronic submission title must be in the format “exam number_wordcount” and nothing else (e.g., “B012345_2864”)
- state your exam number at the top of every page of the document
- state the word count below the title on the first page in the document (e.g., “word count: 2864”)
- use a line spacing of 1.5 and a minimum font size of 11pt (and that applies to all text, including within figures, tables, and captions)
- If you work with a partner, you should also note their exam number either where you list your exam number in the title, or in a footnote from the introduction.
Marking is strictly anonymous. Do not include your name or your student number – only provide your exam number!
Structure
Length limits
- Word limit: 3000 words, excluding bibliography and text within figures and tables but including all other text (such as headings, footnotes captions, examples). Numerical data does not count as text.
- The word limit is a hard limit: there is no +10% allowance.
- Text in figures and tables doesn’t contribute to the word count but, again, use this wisely! Don’t just shove things into tables because they don’t fit in your text!
- Note: You should assume that the markers will only read up to the word limit (i.e. 3000 words). We have had to enforce this because some people have submitted assignments that were hugely over the word limit and it’s not fair on the markers to ask them to read so much over the word count.
- Page limit: no limit, but avoid blank pages
- Figures & tables: no limit on number
Sections and headings
You must use the following structure and section numbering for your report. It corresponds to the structured marking scheme and will make your marker’s life so much easier!
- 1 Introduction
- 2 Theory
- 2.1 Data collection and acoustic features
- 2.2 Training HMMs
- 2.3 Language modelling
- 2.4 Recognition using HMMs
- 3 Experiments
- 3.1 [Insert your first experiment title here]
- 3.2 [Insert your second experiment title here]
- 3.3 etc, for as many experiments as you wish
- 4 Discussion and overall conclusion
You can focus on your speaker-dependent isolated digit system in Sections 2.1 to 2.4 to illustrate the differences between theory and practice. Alternatively, in 2.1-2.4, you may instead go for a more general description of your speaker-independent system. As you don’t have to record and label your own data this year, you can describe this conceptually. If you want to use a concrete example to help illustrate, you can reference a speaker-dependent system built using an existing speakers data set (e.g. ‘simonk’). Although data collection is optional this year, you still need to describe what data are required, how they need to be labelled and so on – the marks for this are under “2.1 Data collection and acoustic features”.
You are advised to structure each experimental section (3.1, 3.2, etc) by introducing and motivating a hypothesis, describing your experimental design, reporting the results, and drawing a conclusion. You should aim to write up 3-4 experiments. You may be able to fit more in, but don’t forget you need to motivate each experiment and analyze the results. It’s fine to design your experiments based on the suggestions given in the instructions (e.g. gender, microphone type, amount of training data), but you may find there are more interesting things to explore. Experimental designs that allow you to draw conclusions across different experiments are not required, but they are definitely appreciated.
Figure, graphs and tables
You should ensure that figures and graphs are large enough to read easily and are of high-quality (with a very strong preference for vector graphics, and failing that high-resolution images). You are strongly advised to draw your own figures which will generally attract a higher mark than a figure quoted from another source.
There is no page limit, so therefore there is no reason to have very small figures.
Your work may be marked electronically or we may print hardcopies on A4 paper, so it must be legible in both formats. In particular, do not assume the markers can “zoom in” to make the figures larger.
References
We generally prefer APA style references (i.e. Author, Year). Other citation styles are also ok as long as they are implemented correctly and consistently.
Declaration of use of AI
Please briefly describe if you use any external Artificial Intelligence (AI) related tools in doing this assignment. This includes grammar checking and use of generative AI based chat apps to investigate the topic. If you use tools based on large language models (e.g. ChatGPT), describe the prompts that you used in doing this work. If you did not use these tools in your work, you can simply write “none”.
Text in this section does not count towards the 3000 word limit.
Remember, you should not use generative AI to generate text for your submitted report. Please see the course AI policy here: Speech Processing AI policy. Please be honest! You will not be penalised for use of these tools as long as you stick to this policy.
Marking scheme
You are strongly advised to read the marking scheme because it will help you focus your effort and decide how much to write in each section of your report.
Here is the structured marking sheet for this assignment (this is the version for 2023-24, which is the same as previous years). Although data collection is optional this year, you still need to describe what data are required, how they need to be labelled and so on – the marks for this are under “2.1 Data collection and acoustic features”. If you did collect your own data, you may obtain extra marks under the “Additional” section.
You should also check out the writing tips provided for the Festival exercise – they apply to this one too
- Lab report