Build your own digit recogniser

A simple but functional digit recogniser built from scratch: record and label data, train HMMs, create a language model, and recognise the test data. Extend to other speakers & digit sequences.

 

Read all the instructions right the way through before you start: from the introduction right through to “digit sequences”.

Remember that the milestones will help you stay on track with this assignment.

Log in
  1. Introduction
    An overview of this assignment

    Speech Processing Assignment 2

    Due date (2025-26 session): Thursday 4 December  2025, 12 noon.  Submit your report on Learn. 

    What you will do

    The goal of this exercise is to learn how an HMM-based Automatic Speech Recognition (ASR) system works, by building one. You will go through the entire processing, starting with data wrangling, model training, and finishing with calculating the Word Error Rate (WER) of the system(s) you have built.

    Initially, you’ll build the simplest possible speaker-dependent system using data from just one person. In the first instance, this will be from a speaker in our existing database, and later you can (optionally) try building one based on your own voice. In doing this, you will get a basic understanding of each step in the process.

    Then you’ll build speaker-independent systems that would actually have real-world applications. In this part of the exercise, you will design and execute a number of your own experiments, to deepen your understanding of ASR generally, and more specifically of what factors affect the WER.

    While this assignment will primarily focus on isolated digit recognition, you can further your understanding by extending your system to recognise digit sequences by writing a new language model.

    What you will submit

    Your assignment submission will be a lab report that introduces your experiments, establishes background on HMM-based ASR,  establishes hypotheses, and tests these hypotheses with in experiments.  You should also discuss the overall findings and implications of your results and draw appropriate conclusions. See write up instructions for more details.

    The word limit for the report is 3000 words.

    Practical Tasks

    1. Get started with the assignment materials and access to HTK on the PPLS AT lab computers (watch the overview video!)
    2. Build a speaker-dependent digit recognition system using a series of bash shell scripts
      1. See lab materials from the “Intermission” (week 7) for pointers on shell scripting
    3. Build a speaker-independent digit recognition system, extending the scripts you used for the speaker-dependent system
    4. Develop hypotheses about what training and testing configurations may improve or harm ASR performance for speaker-independent ASR
      1. These could be due to issues with the data or the model setup, or both.
      2. There are some experiments suggested in the assignment webpages, but you don’t have to stick to those!
    5. Design experiments to test these hypotheses using an existing ‘messy’ data set of previous students’ recordings
    6. Build and test a digit sequence recogniser (optional/bonus marks)
    7. Write up a lab report of your experiments

    We will go over the theory behind HMM-based ASR in modules 7-10 of the course.  So, you may find yourself running scripts for components you don’t fully understand yet when you start the assignment.  This is ok!  In the first labs, the main focus is on using scripting to build an ASR pipeline.  We’ll go over more details through the next weeks.

    How to submit your assignment

    You should submit a single written document in pdf format for this assignment – see the pages on writing up for detailed instructions on what to include and how your report should be formatted.

    Please be sure to read submission guidance on the PPLS hub

    You must submit your assignment via Turnitin, using the appropriate submission box on the course Learn site. There are separate submission links for the UG and PG version of the course (but they are both linked on the same Learn site)

    You can resubmit work up until the original due date. However, if you are submitting after the due date (e.g., you have an extension) you will only be able to submit once – so make sure you are submitting the final version!  If in doubt, please ask the teaching office and again, check the guidance on the PPLS hub.

    Extensions and Late submissions

    Extensions and special circumstances are handled by a centralised university system. Please see more details here: Extensions and Exceptional Circumstance Guidance.  This means that the teaching staff for a specific course have no say over whether you can get an extension or not. If you have worries about this please get in touch with the appropriate teaching office (UG, PG) and/or your student adviser.

    How will this assignment be marked?

    For this assignment, we are looking to assess your understanding of HMM-based ASR works. The report should not be a discursive essay, nor merely documentation of the HTK functions that were used.

    You will get marks for:

    • Completing all parts of the practical, and demonstrating this by completing all requested sections of the report
    • Writing a clear and concise background section (using figures as appropriate) that shows that you understand the differences between HMM-based ASR in theory and the actual implementation used in practice for this assignment.
    • Establishing clear and testable hypotheses, explaining why (or why not) you think these hypotheses will hold with specific system configurations and datasets.
      • You should use citations to support your claims from the course materials (and potentially your own literature review).
      • You should draw on what you’ve learned about phonetics and signal processing to understand why different data subsets may produce different results.
    • Designing and implementing experiments that test those hypotheses, and make the best use of the (limited, messy) data available to you.
    • Clearly presenting  the results of the experiments (using tables, figures and text descriptions) and linking those results back your hypotheses.
    • Discussing the implications and limitations of your experiments.
    • Drawing conclusions with an appropriate level of certainty. Can you be sure that your conclusions are correct? Do you need to do more experiments? It is better to explain your uncertainty than falsely project confidence!
    • You will also be marked based on the overall coherence and the strength of your argumentation – are your claims well supported by the results and citations were appropriate?

    If you do that (i.e., complete all the requested sections of the report), you will likely to get a mark in the good to very good range (as defined by the University of Edinburgh Common Marking Scheme).

    You can get higher marks by adding some more depth to your analysis in each section, but particularly the experimental sections. Can you do additional experiments that shed further light on your results and provide further evidence for (or against) your hypotheses?

    As usual, we have a positive marking scheme.  You will get marks for doing things correctly.  If you write something incorrect, it won’t count towards your mark total but you won’t be marked down (in effect, it will be ignored).  You will NOT get marked down for typos or grammatical errors, but do try to proofread your work – help your markers to give you the credit you deserve!

    Tips for success

    Read all the instructions right the way through before you start. You are given all the scripts for the basic part of the practical, so your work is mainly extending existing scripts, preparing data and running experiments. For the speaker-independent and digit sequence recognisers, relatively small changes will be required, starting from copies of the basic scripts.  Don’t underestimate how long data cleaning/wrangling can take for your actual experiments.

    We suggest that you use a log book to take notes: record each step you perform, so that you have a record for writing your report. Record the experimental conditions and the results carefully. If you modify scripts, record what you did (preferably, make a new script with a meaningful name each time). Be careful and methodical – in the later stages, pay particular attention to which speakers you are training and testing on. Make sure that every result is easily reproducible, simply by running the appropriate script(s).

    Working with other students (and machines)

    We encourage you to work with each other in the lab, to solve problems you encounter during the exercise – this is a positive thing for both the person doing the helping and the person being helped. Do not cross the line between positive, helpful interaction and cheating.

    Note: It’s ok for you to work in pairs and help each other with coding but you must write up your reports independently.

    If you do work with someone else, either in scripting or in designing experiments you must note that (including their exam number) in the introduction of your report.

    It’s good to recognize that collaboration is an essential part of speech and language technology nowadays. Nobody does everything by themselves. However, if you let someone else do a lot of the technical work now, you may find it puts you back a step later, especially if you are intending to do more study or work in speech and language technology.

    You can only really learn to do this sort of practical work by doing it.

    If you do work with others, spread the load between you in terms of coding so you all get a chance to practice!

    These things are definitely ok and, in fact, are encouraged:

    • teaching someone how to write shell scripts, in general terms
    • helping to spot a minor error in someone else’s script
    • explaining what an HTK program does
    • discussion of the theory

    There’s an old saying that you never really understand something until you teach it. Try that with your fellow students.

    These are some of the things you should not be doing:

    • writing a complete shell script for someone else when they have no idea how it works
    • copying parts of someone else’s script without understanding what it does
    • presenting results in your report that you did not help to obtain yourself

    Again, you must write up your reports as independently. So, the following are not ok:

    • helping someone with the content of the written report
    • reading anyone else’s written report before submission
    • showing your written report to anyone else before submission
    • working with someone else, but not mentioning that you did so in your report
    Use of grammar checkers is allowed under university regulations, just as it is fine to ask someone else to proofread your work. However, you should NOT use automated tools such as ChatGPT to generate text for your report (just as it is NOT ok to get someone else to write your report for you). If want to try to use ChatGPT to help learn concepts via dialogue, just be aware that it still often gets very basic things wrong in this area.  Please see the course guidance on the use of AI in assignments.  If you use generative AI tools in your work, you must declare it! 

    How to get help

    The first port of call for help is the course labs. The labs for the rest of the semester (modules 7-10)  will be devoted to this assignment. It’s generally easier to troubleshoot issues with HTK and shell scripting in person, so do make use of the labs! It’s fine to come to either the 9am or 4pm Wednesday labs (or both as long as there’s room!) but you will find that the 9am lab is quieter so you will get more attention than in the 4pm lab.

    Teaching staff will give overviews or demonstrations at the beginning of each lab, so come on time.

    You can also ask questions on the speech.zone forum. The forum is the preferred method for getting asynchronous help, as it means that any clarifications or hints are shared with the whole class.  You will need to be logged into speech.zone in order to see the assignment-specific forums.

    You can also come to office hours or book 1-1 appointment with the lecturers, or send us an an email – see contact details in Module 0.

     


    Log in

    1. HTK essentials
      This is a widely-used toolkit in automatic speech recognition research.

      We will be assuming you have version 3 of HTK, although everything should work with any recent version.

      All HTK commands start H, for example, HList is a tool to list the contents of data files (waveforms, MFCCs, etc.). The tools use command line flags and arguments extensively, as well as configuration files. Command line flags in capital letters are common across tools, such as:

      • -T 1 specifies basic tracing
      • -C resources/CONFIG tells the tool to load the config file called CONFIG in the directory resources

      Simple scripts are provided for building a basic speaker-dependent digit recogniser from your own data. You will need to modify them slightly to make the more advanced system later. You will need to refer to the HTK manual.

      HTK is not open source, so if possible you should obtain the manual for HTK is to register on the HTK website.   However, you can also browse the manual on the web: HTK book as webpage.   If you are logged into speech.zone you can also find a copy in the assignment forums.  You don’t need to read the manual yet, but it will be useful later.

      To modify the scripts you’ll need to use a text editor, such as Atom or emacs. Never use a word processor!

      Everywhere in this handout that you see <username>, you should replace it with your username. For example lab/<username>_test.mlf would become lab/s1234567_test.mlf.

      File formats can quickly get confusing. There are dozens of waveform formats, and various formats for model parameters, MFCCs and so on. We will record the speech data into Microsoft Wav (RIFF) format waveforms, which will have the extension .wav, for example.

      Please stick to the recommended names of files and directories – it will make it easier for me to help you (Edinburgh students only).

      Related forum

      Viewing 14 topics - 1 through 14 (of 14 total)
      Viewing 14 topics - 1 through 14 (of 14 total)
      • You must be logged in to create new topics.

  2. Getting Started
    Here are the files you need to get started and tips for how/where to do the assignment

    Overview video

    The handout mentioned in the video has been replaced by these web pages. The videos are made on a Mac, but you will be working on Linux.

    In this video:

    • Read all the instructions (on these webpages) before you start
    • The folder structure that is provided as a starting point
    • The ‘recipe’ for creating an ASR system
      1. collect training and testing data (optional)
      2. label the data (optional)
      3. parameterise the waveform files as MFCCs (optional)
      4. initialise and train one HMM for each digit class
      5. make a simple grammar
      6. test your system

    In this assignment you will work with the HTK hidden Markov model toolkit. This is a powerful and general purpose toolkit from Cambridge University for speech recognition research, but we won’t worry about many of the more advanced features on this exercise.

    How to access HTK and the assignment materials

    In the PPLS AT lab or via remote desktop

    If you are working in the PPLS AT lab, you can use the version of HTK that is already installed.  All you need to do is copy the provided scripts to your own working directory, so that you can edit and extend them yourself (See ‘Getting the scripts’ below)

    Connecting via ssh

    If you’re finding the remote desktop is slow and you’re comfortable using the terminal, you can technically do everything via ssh as you don’t have to listen to anything through HTK or use the GUI.  Using the terminal (or powershell/PuTTy/Ubuntu terminal on Windows), you can connect using:

    ssh your_uun@[AT lab machine name].ppls.ed.ac.uk

    where you substitute [AT lab machine name] for one of the remotely accessible machines here.

    You can open multiple terminals and ssh instances at once. This can be helpful if you are editing code in one terminal and running it in another.

    Alternative: Installing HTK on your own computer (not recommended)

    Please note: As of 5 Nov 2025, the actual HTK website is not working, so we strongly recommend you use the installation on the lab machines instead. 

    If you want to try installing HTK on your own computer, there are some tips on this page with instructions for downloading the data.  You can also look on the assignment forums, but our ability to support self installs is limited – success has been variable, especially with newer laptops.  If you have a good internet connection, there isn’t that much to gain from installing HTK yourself versus working on the PPLS AT lab servers remotely (e.g. through an ssh connection).

    NOTE: From now on, the instructions generally assume you are using the PPLS AT lab computers in-person or using the remote desktop. 

    Getting the scripts

    Everyone will need to get a copy of the assignment scripts. If you are using the PPLS AT lab computers (remote or in-person), you can get the assignment scripts by running the following commands in the terminal (assuming you’ve already made the folder ~/Documents/sp)

    cd ~/Documents/sp
    ## Don't forget the dot "." at the end of the next line!
    cp -R /Volumes/Network/courses/sp/digit_recogniser .
    cd digit_recogniser
    

    You’ve now copied the scripts you need for the assignment.  The data that these scripts use is on the servers in /Volumes/Network/courses/sp/data

    If you are working on the PPLS AT lab computers, you don’t have to copy over the data (e.g. feature and label files), so there’s nothing else you need.  If you install HTK yourself you will need to download the data (see the tips below).

    Now, you can go on to building your first speaker dependent ASR system! 

  3. Speaker-dependent system
    The main task here is gathering the data. After that, just run the provided scripts.
    Log in
    1. Collect and label the data (optional)
      Supervised machine learning needs labelled data. The task of collecting and labelling this data is often overlooked in textbooks. Performing this step yourself is OPTIONAL, but you still need to understand the process.

      The data collection step is OPTIONAL to reduce workload. But you still need to understand how the data are created, and you will need to describe this in your report, so make sure you understand how you would collect data. You may want to create your own test set for certain experiments, so will need to come back to this later if you decide to do that.  Note, you won’t get more marks for simply recording and labelling the data.  Instead think about how you can use any new recordings to extend and deepen your experiments and analysis.

      You need data to train the HMMs, and some test data to measure the performance. Be very careful to keep these two data sets separate! Record the ten digits, and make sure you pronounce “0” as “zero”. Please record 7 repetitions of each digit for training, and a further 3 repetitions of each digit for testing.

      Record all the training data into one long audio file, and all the test data into another. You should randomise the order in which you record the data (the reasons for doing this with the test data will become clear in the connected digits part, later). Remember to say the digits as isolated words with silence inbetween. I find it’s easier to write down some prompts on paper before recording.

      Make a note of the device you used and edit the make_mfccs script to include this information. You should take a look at the existing info.txt file and use compatible notation :

      1. username
      2. gender (m or f)
      3. mobile phone make and model written in lowercase without spaces (e.g., apple_iphone_8, or samsung_galaxy_a10)
      4. your accent: native speakers should specify their accent, non-natives should specify “NN” plus their native country)

      For example:

      INFO=${USER}" m apple_iphone_8 NN France"
      

      This allows your data to be combined with other peoples’ later on in the practical.

      Your recording will probably be in mp3 format, so you will need to convert it to wav format before proceeding. The program sox can do this – ask a tutor for help in a lab session. Name the output files something like s1234567_train.wav and s1234567_test.wav .

      Label your training data

      You need to label the training data. Since the training data are in a single long wav file, there will be a single label file to go with it. There will be labels at the start and end of every recorded word. You don’t need to be very precise with these labels, but do be careful to include the entire word. It’s OK to include a little bit of preceding/following silence. It’s not OK to cut off the start or end of the word.

      In this video:

      1. Start wavesurfer and load the file you wish to label (e.g., s1234567_train.wav) and select transcription as the display method. Create a waveform pane.
      2. You now have to label the speech. The convention is that labels are placed at the end of the corresponding region of speech. So, end all words with the name of the digit (use lowercase spelling: zero, one, two, …), and label anything else, including silence as junk. You must label the silence because these labels are use to determine the start times of the digit tokens. So, make sure that there is a junk label at the start of each word (the very first label in the file will therefore be junk).
      3. Right click in the labeling window and select properties in the window that opens: change the label file format to WAVES and click ok
      4. Right click in the labelling window again and select save transcription as to save your labels into a file called <username>_train.lab in the lab directory. During labelling, use save all transcriptions to save your work regularly.

      NOTE: You can do this labelling in Praat instead, but you’ll have to convert the TextGrid to xwaves .lab format and also extract the individual wav files of the digits. See the .lab files in the assignment data.

      Prepare your test data

      We need isolated digit test data (each digit in a separate file) and can use Wavesurfer to split the waveform file into individual files, based on labels.

      1. Use wavesurfer to create labels as before. However, do not label the speech with the digit being spoken but instead use the labels <username>_test01 , <username>_test02, <username>_test03 and so on. Place one of these labels at the end of each spoken digit and place a label called junk at the start of every digit. As for the training data, leave a small amount of silence around each spoken digit.
      2. To split the file into individual files, right-click and select split sound on labels.
      3. After this, you should have lots of files called <username>_test01.wav, <username>_test02.wav, <username>_test03.wav and so on, in the wav directory. You will also have one or more files with junk in their names. You don’t need those and can delete them.

      Some versions of Wavesurfer create a subdirectory containing the split files, and add a numerical prefix to them. If that happens, rename your files (e.g., manually, or using a script if you know how) so they are called something like s1234567_test01.wav and move them up to the wav directory.

      Listen to some or all of the files to check they are OK.

      Create a master label file for the test data

      Although all our test data are now in individual files, for convenience we can use a single MLF (master label file) for the test data. This label file contains the correct transcription of the test data, which we will need when we come to measure the accuracy of the recogniser. You will need to listen to the test wav files in order to write this label file.

      Use your favourite plain text editor, such as Atom, but not a word processor, to create this file.

      It needs to look something like this (note the magic first line, and the full stop (period) marking the end of the labels for each file; make sure there is a newline after the final full stop in the file):

      #!MLF!#
      "*/s1234567_test01.lab"
      zero
      .
      "*/s1234567_test02.lab"
      nine
      .
      

      and save the file as lab/<username>_test.mlf

      The format of the MLF file is as follows: The first line is a header: #!MLF!#. Following this are a series of entries of filenames and the corresponding labels. "*/s1234567_test01.lab" is the filename: make sure you include the quote marks and the * because these are important. After the filename follows a list of labels (there is only one label per file in our case) and then a line with just a full stop, followed by a newline, which denotes the end of the labels for that file.

    2. Parameterise the data (optional)
      Our HMMs do not work directly with waveforms, but rather with features extracted from those waveforms. Performing this step yourself is OPTIONAL, but you still need to understand the process.

      This data collection and feature extraction steps are OPTIONAL. Do not execute any of the commands in this section unless you have recorded your own data. However, you still need to understand the feature extraction process, so you can describe it in your report.

      Each waveform file must be parameterised as MFCCs. This is done using the HTK command HCopy. The file called CONFIG_for_coding file, which is provided, specifies the settings used for the MFCC analysis. You should keep these settings for this assignment.

      Do not run HCopy by hand – instead, you should use the make_mfccs script. The script runs HCopy like this:

      HCopy -T 1 -C resources/CONFIG_for_coding wav/file.wav mfcc/file.mfcc
      

      Run the make_mfccs script as soon as you have finished preparing and checking the data – it will copy your data and labels into a shared directory, making it available to the rest of the class.

      If you make any changes to your data (e.g., correcting a label) then you must run the script again.

      Make sure you are in the digit_recogniser subdirectory, then run:

      ./scripts/make_mfccs
      

      Scroll back up through the output from this command to see if there were any errors. If there were, correct the problems (e.g., move files to the correct places) and run the command again.

      Sharing your data

      The make_mfccs script copies your data to a shared folder for use by other students. If you discover and fix any errors in your data, you should re-run this script.

      Related forum

      Viewing 15 topics - 1 through 15 (of 27 total)
      Viewing 15 topics - 1 through 15 (of 27 total)
      • You must be logged in to create new topics.
    3. Train the acoustic models
      We will used supervised machine learning (including the Baum-Welch algorithm) to train models on labelled data.

      Start here, at this step, to build your first speaker-dependent digit recognizer.

      The training algorithms for HMMs require a model to start with. In HTK this is called the prototype model. Its actual parameter values will be learned during training, but the topology must be specified by you, the engineer. This is a classic situation in machine learning, where designing or choosing the type of model is made using expert knowledge (or perhaps intuition).

      Select your prototype HMMs

      In this video:

      1. We need a model to start with – in HTK this is called a “prototype model”
      2. A prototype model defines the:
        • dimensionality of the observation vectors
        • type of observation vector (e.g., MFCCs)
        • form of the covariance matrix (e.g., diagonal)
        • number of states
        • parameters (mean and variance) of the Gaussian probability density function (pdf) in each state
        • topology of the model, using a transition matrix in which zero probabilities indicate transitions that are not allowed and never will be, even after training

      You can experiment with different topologies – although for isolated digits, the only sensible thing to vary is the number of states. Models with varying numbers of states are provided in models/proto (remember, a 5 state model in HTK actually has only 3 emitting states). In your later experiments, modify the initialise_models script to try different prototype models. You might even want to create additional prototype models.

      Now train the models

      In this video:

      1. Training consists of two stages that we’ll look at in more detail in modules 9 and 10:
        • the model parameters are initialised using HInit which performs a simple alignment of observations and states (uniform segmentation), followed by Viterbi training
        • then HRest performs Baum-Welch re-estimation
      2. A close look at the initialise_models script

      In the simple scripts that you have been given as as starting point, each of the two stages of training is performed using a separate script. You can run them now:

      ./scripts/initialise_models
      ./scripts/train_models
      

      Note: If you haven’t recorded your own data, running these scripts is going to throw an error. Can you see why from the error message? You can fix this by editing the scripts to use data from the user simonk rather than calling the command `whoami`.

      In your later experiments, you will want to automate things as much as possible. You could combine these two steps into a single script, or call them in sequence from a master script.

    4. Language model
      Even for isolated digits, we need a language model to compute the term P(W).

      We’ll see in modules 9-10 that a finite state language model very easy to combine with HMMs. In your report, explain why this is.

      The language model computes the term P(W) – the probability of the word sequence W. We’re actually going to use what appears to be a non-probabilistic model, in the form of a simple grammar. To think of this in probabilistic terms, we can say that it assigns a uniform probability to all allowed word sequences, and zero probability to all other sequences.

      Why do we need a model of word sequences when we are doing isolated digit recognition?

      HTK has tools to help write grammars manually, then convert them into finite state models. A grammar for isolated digits, and the corresponding finite state model, is provided for you in the resources directory, called grammar; have a look at it. The finite state model is in grammar_as_network.

      The HTK manual contains all the information you need to understand the simple isolated word grammar, and how to extend that to connected digits.

      Later, if you do the connected digits experiment, you will write a new grammar, from which you can automatically create a finite state network version using the HParse tool.

    5. Recognition and Evaluation
      By comparing the recogniser's output with the hand-labelled test data, we can compute the Word Error Rate (WER).

      Now we are ready to run the recogniser on some test data. You should run this on some existing isolated test digits (one digit per wav file) and the MFCC files from them. We run the Viterbi decoder using HVite. The script recognise_test_data does this.   HVite makes use of your trained HMMs (one HMM per digit from the training step) and your language model.  In the first instance, your language model is a very simple grammar that will ensure the recogniser will output just one digit (i.e., word) per recording. Later you can extend this grammar to recognise arbitrary length digit sequences.

      The output from recognise_test_data is stored in the rec directory. Look at the recogniser output and compare it to the correct answers. To calculate the Word Error Rate (WER), we use the results script.

      Again, you’ll need to edit the scripts to use a specific user (e.g. simonk) and the full data directory, rather than `whoami` and the data_upload directory.

  4. Speaker-independent systems
    It's time to make the system more useful for a real application. You should focus on speaker-independent systems for your report.
    Log in
    1. Data
      You can use data from other students in the class, and from previous years, to conduct your experiments.

      The data from over 400 students is available for you to use in your experiments. You can find it in on the PPLS AT Lab servers in the directory: /Volumes/Network/courses/sp/data

      In that directory, you will find a file called info.txt that describes the available data.

      If you record your own data this year (which was optional), we can also (eventually) add that to the collection, and to info.txt. But, you don’t need to wait: all the data from previous years is already available.

      The format of info.txt is quite simple. If you already have some coding experience, then you’ll probably want to automatically parse this file and automatically pull out subsets of speakers with the characteristics you want for each experiment. If you don’t have coding experience, try doing this in a spreadsheet.

      The data are not perfect and you might want to think about how to detect bad data, so you can exclude those speakers from your experiments.

      Many HTK functions use the -S option which allows you use input a script file: a text file that includes a list file names (1 per row). For example, using a script file will allow you to train your models on data from multiple speakers. See this thread for an example:

      See this thread for an example of how to get test results from multiple speakers:

    2. Skills you will need
      To perform your more advanced experiments, it's worth mastering a few new skills. They will help you run more, and better, experiments.

      Experimental design

      You need to think carefully about each experiment, and what it is designed to discover. Do not simply run random, ill-thought-out experiments and then try to make sense of the results. Instead, form a hypothesis, and express it as a falsifiable statement, like this:

      When training models only on speaker A, the word error rate will be lower for speaker A’s test data, than for test data from speaker B.

      and then design an experiment to test whether that is true.  In the report you should also take care to explain why you formulated the hypothesis the way you did (i.e., why did you think it might be true?  Do you have evidence from the literature? Observations of other speakers?)

      The key to a good experimental design here is to control any factors that you are not interested in and only investigate a single factor (e.g., speakers’ gender) per experiment.  However, in some cases you might find you don’t have exactly the data you need to completely balance every factor.  That is ok – the data definitely isn’t perfect for every possible experiment!  Try your best and explain why you made the decisions you did and what you could do to improve the design (e.g., what data would you need? how confident are in your results given limitations in the experimental design?)

      Shell scripting

      A few fairly simple shell scripting techniques will help you enormously here. You can completely automate your experiments. This not only makes them more efficient for you, it also makes it easier to reproduce your results.

      There are several resources in the “Intermission” lab tab if you need some help getting started with the shell, particular this intro to shell scripting. You may also find this bash primer by Jason Fong helpful!

       

      Viewing 15 topics - 1 through 15 (of 17 total)
      Viewing 15 topics - 1 through 15 (of 17 total)
      • You must be logged in to create new topics.
    3. Experimental design suggestions
      These example experiments are just to get you thinking. You should devise interesting experiments of your own.

      You should focus on experiments using speaker-independent systems for your report.

      To think about why this is,  consider using the already-trained models from the speaker-dependent experiment.

      If you use models trained only on your own speech to recognise the Test set of another speaker, what Word Error Rate do you expect?

      The WER is probably going to be high unless the recordings of the other speaker sounds pretty much like yours.  So, this isn’t going to tell you anything very informative about what makes ASR better or worse.  Instead, your main experiments should investigate speaker-independent systems trained and tested on larger numbers of speakers so that your finding potentially have wider applicability.

      When designing your experiments, the testing speakers must be distinct from the training speakers. You need to control all factors that are not of interest: (including: accent, gender, microphone type, amount of training data).

      Below are some possible experiment designs for experiments looking at the effect of gender in training and testing the digit recogniser. Some of these designs are better than others. Can you work out the pros and cons of each design?

      1. The effect of gender, with simplistic control over accent and microphone
        • Training set: the training data of 20 female UK English speakers using headset microphones
        • Test set A: the test data of 20 female UK English speakers not in the training set, also using headset microphones
        • Test set B: the test data of 20 male UK English speakers (obviously not in the training set), also using headset microphones
      2. The effect of gender, with more sophisticated control over accent and microphone (version 1)
        • Training set A: the training data of 50 female speakers, with a mixture of accents and microphones
        • Training set B: the training data of 50 male speakers, with a mixture of accents and microphones in the same proportions as training set A
        • Test set: the test data of 50 female speakers not in training set A, with a mixture of accents and microphones in the same proportions as training set A
      3. The effect of gender, with more sophisticated control over accent and microphone (version 2)
        • Training set: the training data of 50 female speakers with a mixture of accents and microphones
        • Test set A: the test data of 50 female speakers not in the training set, with a mixture of accents and microphones in the same proportions as  the training set
        • Test set B: the test data of 50 male speakers, with a mixture of accents and microphones in the same proportions as the training set

      In general, you should try to go for more sophisticated designs that allow you to use more data as this potentially allows you to generalise your findings better.   However, there are always tradeoffs to be made deciding on an experimental design.  One issue might be that you simply cannot find the same proportions for the factors that you want to control.  You’ll need to consider are the pros and cons of using a not quite perfectly balanced design with covering more test speakers per condition of interest (e.g. gender in the examples above) compared to a perfectly balanced design that only has a few test set speakers for each condition?  If your test set only has 5 speakers per condition, do you think the results would be very different if you picked 5 other speakers for a specific condition?

      Some possible experiments to try for this data set

      What effect does microphone type have?

      Design an experiment to discover whether the microphone type is important. This might involve discovering if some microphones give lower Word Error Rate than others, or finding out the effect of mismatches between the Training and Test sets. Remember to control all the other factors.

      You can perform equivalent experiments to investigate the gender and accent factors too.

      What effect does the amount of training data have?

      In machine learning, it’s often said that more training data is better. But is that always the case? Design some experiments to explore this. Include cases where the Training and Test sets are well-matched (e.g., in gender and/or accent, etc) and cases where there is mismatch. What is more important: matched training data, or just more data?

      These questions are very important in commercial systems: it costs a lot of money to obtain the training data, so we want to collect the most useful data we can.

      There are many other experiments you can think about!  You don’t have to restrict yourselves to the ones listed here! 

    Use the forums to discuss your ideas for more advanced experiments. You’ll need to log in to access the forum for this exercise.

    Private

    • You do not have permission to view this forum.
  5. Digit sequences (optional)
    This is not as hard as you might think. If you attempt this, restrict your experiments to a speaker-dependent system using either your own data (if you collected some) or that from a single existing speaker in the database.

    This is a bit harder than isolated digits, but not much. The key is to realise that there may be silence between the words, so you will need an acoustic model (i.e., an HMM) for that. Hint: silence was labelled as junk in the data collection, so you could consider that as another word.

    You’ll also need to write a new language model in the form of an HTK grammar (hint: see the HTK manual) and then convert it to the finite-state format that HVite needs, like this:

    $ HParse resources/digit_sequence_grammar resources/digit_sequence_grammar_as_network
    

    Evaluating the output of the recogniser is no longer so easy – it might insert or delete digits, as well as substitute incorrect digits. You can use your existing training data, but you’ll need to make a different test set, containing digit sequences, and a corresponding label file.

    How to do this is described in the HTK manual (3.4.1), Chapter 12: Networks, Dictionaries and Language Models. Specifically, have a look at section 12.3.

    You can also look at the HTK manual for HResults – there are some useful options for showing more detail of the scoring procedure, such as the flag -t

    Some helpful forum posts:

    Remember you don’t want to count junk labels when scoring! (See last forum post above)

  6. Writing up
    You're going to write up a lab report, that describes how HMM-based ASR works, and reports your experiments.
    Log in
    1. Lab report: Structure and Tips
      Write up your findings from the exercise in a lab report, to show that you can make connections between theory and practice. This page describes the required lab report structure.

      You should write a lab report about this speech recognition practical. Keep it concise and to the point, but make sure explain what the digit recogniser is in theory and how it is implemented and being used in practice. You should also report your experimental work, clearly explaining your experimental design, your results, and link your results to your research questions and hypotheses through your conclusions.

      Report Structure

      Title/Author/Word count info

      Make sure you include a title, your  exam number, and your word count at the beginning of your report.

      • Choose an informative title that tells the reader a bit about what the report is about. For example, “SP Assignment 2” is not very informative.
      • State your exam number (on your student card, starts with a B)
      • State the word count (see formatting guidance)

      The following gives an outline of the sections and basic information you need to include. We are, as always, following the University of Edinburgh common marking scheme. So, to get highest marks, start with the basics and  try to add some extra analysis, contextualisation or critique beyond the basic information.

      1 Introduction

      [5 marks]

      This section should give an overview of your report.

      • You should briefly introduce:
        • The task you are focused on (i.e.,  digit recognition as a whole word ASR problem) and the goals of your study
        • The motivation for the specific experiments that you focus on later in the report
      • You can also highlight any key findings you made and/or implications you have drawn from your experiments

      2 Data

      [5 marks]

      Describe the dataset you use in your experiments:

      • Who were the speakers?
      • When did the record the data?
      • How did they record the data? What were the recording conditions?
      • What other metadata is available about the recordings?

      3 Method

      In this section, you should describe the digit recogniser itself following the subsections headings listed below. You should explain what its components are, how it is trained, how it is used for speech recognition and how you will evaluate its performance. You should contrast how the digit recogniser relates to ASR theory and relate this to how this specific assignment setup does ASR in practice.

      3.1 Feature Extraction

      [10 marks]

      This section relates to Module 8: Speech Recognition – Feature Engineering.

      • What are MFCCs? What do they capture?
      • Why are MFCCs used in this digit recogniser setup?
      • How were the MFCCs extracted?

      3.2 Training

      [10 marks]

      This section relates to Modules 9: Speech Recognition – the Hidden Markov Model, and Module 10: Speech Recognition – Connected Speech and HMM training.

      • How are HMMs used in the digit recogniser?
      • How are the HMMs trained?
        • What parameters are “learned” during training?
        • What methods are used for training?
          • There are two scripts used for training the digit recogniser. What do they do differently?
          • How and when are uniform segmentation, Viterbi training, and the Baum-Welch algorithm used?
          • What’s the relationship between the Viterbi algorithm and Viterbi training?
        • Why do we have several training steps?
        • What do each of the steps do?
        • How does each step make use of the extracted features?

      3.3 Recognition

      [10 marks] 

      This section relates to Modules 9: Speech Recognition – the Hidden Markov Model.

      • What methods/components are used for digit recognition?
        • How is the Viterbi algorithm related to recognition?
        • What is token passing and how is it used?
      • What’s the role of the acoustic model and how does this relate to HMM training for this specific digit recogniser setup?
      • What’s the role of the language model? What is the specific language model used for the assignment digit recogniser? What’s the difference in language model for the single digit recogniser and the digit sequence recogniser.
      • How are the acoustic model and language model combined to perform recognition?

      3.4 Evaluation

      [5 marks]

      This section is related to Module 10 (see readings).

      • What metric do we use to evaluate how well a specific digit recogniser setup did? How is it calculated?
      • You can also  note another other consideration around evaluation of the digit recogniser in this section.
      • You aren’t required to do statistical tests for this assignment, but if you have experience in this area you can apply them in this assignment if you want to and if it makes your argument stronger (you won’t get marks for simply applying a lot of tests without reasoning). If you do use statistical tests, explain what you are using and why.

      4 Experiments

      Having established the data you will use and the digit recogniser (as method in theory and practice), it’s now time to write up your specific experiments.

      Use your time in the labs to discuss your experiments and troubleshoot your scripts with the tutors.

      You should use a separate subsection to describe each of your main experiments (these may consist of further sub-experiments addressing the section’s research question/hypothesis).

      Your mark will reflect your explanation of the research/questions, experimental design, presentation of results, and conclusions across all your experiments. So within each subsection, you should make sure the following are clear:

      • [10 marks] Research questions and hypotheses
        • What are you trying to find out from your experiment?
        • What hypotheses can you make based on your knowledge of speech and ASR?
        • You should explain what evidence you have for your hypotheses. This can come from citations, but also reasoning based on what we have covered in class. Use citations and references where you can to strengthen your argument.
      • [10 marks] Experimental design
        • Explain your experiment setup and how it relates to your hypotheses
        • What specific data did you use?
        • If you alter the model, what did you change?
        • In general, aim for reproducibility so that another student could implement the same experiment. You don’t have to include detailed speaker lists, but do describe how you selected speakers for training and testing sets
      • [10 marks] Results
        • Present your results clearly with tables and/or graphs
          • Aim for readability with tables and figures. A reader should be able to get the idea of whether the results support your hypothesis from a quick glance. That means using appropriate captions, labelling axes and giving table rows and columns informative names.
      • [10 marks] Conclusions
        • Discuss whether your results support your hypothesis (or not)
        • Discuss limitations of your experiments and how certain you are of your findings.
          • Do you think your results would generalise to other speakers/recordings with the same characteristics? How about speakers/recordings with some different characteristics?
          • Are there other experiments you could do (if you had more time) that would help understand whether the results would generalise?
        • Extra: if you can connect your results to other findings in the literature: what does your experiment contribute to the bigger picture on ASR performance?

      5 Discussion and Overall Conclusion

      [5 marks]

      In this section you should briefly summarise and discuss your findings across your experiments and their implications: what do you know now about ASR that you didn’t know before?

      • What were your main findings?
      • Discuss the implications of your findings: Can you make any connections between the experiments in terms of what factors affect the digit recogniser performance the most?
      • Are there any general conclusions you can make about the digit recogniser setup? If not, could you modify your experiments so that you could make more general conclusions?

      Range of experiments

      [10 marks]

      In addition to the report sections listed above, you will also receive some marks for the range of experiments that you write up (you don’t need to write a specific section for this, it should be evidenced from section 4). At a minimum you should report on at least 2 experiments to pass.  Historically, most people can do well writing up 3-4 experiments.

      You will get more marks for exploring more aspects of the digit recogniser, but you should also ensure your experiments actually help you get stronger conclusions about your research questions/hypotheses. So, it can be beneficial to design follow-up experiments that shed more light on a specific research question. This helps us to see your depth of understanding.

      In general, exploration of 3 research questions done in depth with well-considered experimental designs (possibly reporting on sub-experiments) will get you a better mark than many unconnected experiments with simplistic experimental designs.

      You don’t necessarily have to design your experiments such they all relate to one another (i.e., build to towards one big research question for the whole report). But if you can do this, and so support any overall conclusions you make, that would be seen as a good thing!

      More writing advice

      What exactly is meant by “lab report”?

      It is not a discursive essay. It is also not merely documentation of commands that you ran and what output you got. It is a factual report of what you did in the lab that demonstrates what you learned and how you can relate that to the theory from lectures. You will get marks for:

      • completing all parts of the practical, and demonstrating this in the report
      • a clear demonstration that you understand what each part of the digit recogniser does, how this relates to ASR theory and how this relates to the HTK tools we use in practice
      • clear and concise writing, and effective use of diagrams, tables and graphs

      How much background material should there be?

      Do not spend too long simply restating material from lectures or textbooks without telling the reader why you are doing this.

      Do provide enough background to demonstrate your understanding of the theory, and to support your explanations of how HTK works and your experiments. Use specific and carefully chosen citations. Cite textbooks and papers. You will get more marks if you cite better and more varied sources (e.g., going beyond the essential course readings). If you only cite from the main textbook, this will not get you top marks. Avoid citing lecture slides or videos, unless you really cannot find any other source (which is unlikely). Make sure everything you cite is correctly listed in the bibliography.

      Writing style

      The writing style of the report should be similar to that of a journal paper. Don’t list every command you typed! You do not need to include your shell scripts. Use diagrams to illustrate your report, and tables and/or graphs to summarise your results. Do not include any verbatim output copied from the Terminal: you will not receive any marks for this.

      We won’t take into consideration any appendices.

      Other tips

      You do not need to list the individual speakers in each of your data sets, but do carefully describe the data (e.g., “20 male native speakers of Scottish English using laptop microphones”). You might use tables to present this in a compact form, and perhaps gives short names or acronyms to each set, such as “20-male-SC-laptop”.

    2. Formatting instructions
      Specification of word limits and other rules that you must follow, plus the structured marking scheme.

      Please be sure to check general turnitin submission guidance on the PPLS hub

      You must:

      • submit a single document in PDF format.
        • The filename must be in the format “exam number_wordcount.pdf” (e.g., “B012345_2864.pdf”)
        • When submitting to Learn, the electronic submission title must be in the format “exam number_wordcount” and nothing else (e.g., “B012345_2864”)
      • state your exam number at the top of every page of the document
      • state the word count below the title on the first page in the document (e.g., “word count: 2864”)
      • use a line spacing of 1.5 and a minimum font size of 11pt (and that applies to all text, including within figures, tables, and captions)
      • If you work with a partner, you should also note their exam number either where you list your exam number in the title, or in a footnote from the introduction.

      Marking is strictly anonymous. Do not include your name or your student number – only provide your exam number!

      Structure

      Length limits

      • Word limit: 3000 words maximum, excluding bibliography and text within figures and tables but including all other text (such as headings, footnotes captions, examples). Numerical data does not count as text.
        • The word limit is a hard limit: there is no +10% allowance.
        • Text in figures and tables doesn’t contribute to the word count but, again, use this wisely! Don’t just shove things into tables because they don’t fit in your text!
        • Note: You should assume that the markers will only read up to the word limit (i.e. 3000 words).  We have had to enforce this because some people have submitted assignments that were hugely over the word limit and it’s not fair on the markers to ask them to read so much over the word count.
      • Page limit: no limit, but avoid blank pages
      • Figures & tables: no limit on number

      Sections and headings

      You must use the following structure and section numbering for your report. It corresponds to the structured marking scheme and will make your marker’s life so much easier!

      • 1 Introduction
      • 2 Data
      • 3 Methods
        • 3.1 Feature extraction
        • 3.2 Training
        • 3.3 Recognition
        • 3.4 Evaluation
      • 4 Experiments
        • 4.1 [Insert your first experiment title here]
        • 4.2 [Insert your second experiment title here]
        • 4.3 etc, for as many experiments as you wish
      • 5 Discussion and overall conclusion

      You should focus on your the speaker independent system in Section 3 to illustrate the differences between theory and practice, and as this is the system you will use for the your experiments.

      You don’t have to record and label your own data this year. Instead, you should describe the dataset you are using and how it was collected. You still need to describe what data are required, how they need to be labelled and how features were extracted for section 2 and 3.1. 

      You are advised to structure each experimental section (4.1 ,4.2, etc) by introducing and motivating a hypothesis, describing your experimental design, reporting the results, and drawing a conclusion (see lab report: structure and tips. You should aim to write up 3-4 main experiments (these main experiments may be broken down into sub-experiments). You may be able to fit more in, but don’t forget you need to motivate each experiment and analyze the results. It’s fine to design your experiments based on the suggestions given in the instructions (e.g. gender, microphone type, amount of training data), but you may find there are more interesting things to explore.  Experimental designs that allow you to draw conclusions across different experiments are not required, but they are definitely appreciated.

      Figure, graphs and tables

      You should ensure that figures and graphs are large enough to read easily and are of high-quality (with a very strong preference for vector graphics, and failing that high-resolution images). You are strongly advised to draw your own figures which will generally attract a higher mark than a figure quoted from another source.

      There is no page limit, so therefore there is no reason to have very small figures.

      Your work may be marked electronically or we may print hardcopies on A4 paper, so it must be legible in both formats. In particular, do not assume the markers can “zoom in” to make the figures larger.

      References

      We generally prefer APA style references (i.e. Author, Year). Other citation styles are also ok as long as they are implemented correctly and consistently.

      Reference don’t count towards the 3000 word limit.

      Declaration of use of AI

      After your main report, please briefly describe if you use any external Artificial Intelligence (AI) related tools in doing this assignment.  This includes grammar checking and use of generative AI based chat apps to investigate the topic.  If you use tools based on large language models (e.g. ChatGPT), describe the prompts that you used in doing this work.  If you did not use these tools in your work, you can simply write “none”.

      Text in this section does not count towards the 3000 word limit.

      Remember, you should not use generative AI to generate text for your submitted report.  Please see the course AI policy here:  Speech Processing AI policy.  Please be honest!  You will not be penalised for use of these tools as long as you stick to this policy.

      Marking scheme

      You are strongly advised to read the marking scheme because it will help you focus your effort and decide how much to write in each section of your report.

      More details about what to include in the lab report can be found on the lab report: structure and tips page.

      Here is the structured marking scheme for this assignment.

       

    You should also check out the writing tips provided for the Festival exercise – they apply to this one too