Introduction

An overview of this assignment

Speech Processing Assignment 2

Due date (2023-24 session): Thursday 7 December 2023, 12 noon. Submit your report on Learn.

What you will do

The goal of this exercise is to learn how an HMM-based Automatic Speech Recognition (ASR) system works, by building one. You will go through the entire processing, starting with data wrangling, model training, and finishing with calculating the Word Error Rate (WER) of the system(s) you have built.

Initially, you’ll build the simplest possible speaker-dependent system using data from just one person. In the first instance, this will be from a speaker in our existing database, and later you can (optionally) try building one based on your own voice. In doing this, you will get a basic understanding of each step in the process.

Then you’ll build speaker-independent systems that would actually have real-world applications. In this part of the exercise, you will design and execute a number of your own experiments, to deepen your understanding of ASR generally, and more specifically of what factors affect the WER.

While this assignment will primarily focus on isolated digit recognition, you can further your understanding by extending your system to recognise digit sequences by writing a new language model.

What you will submit

Your assignment submission will be a lab report that introduces your experiments, establishes background on HMM-based ASR, establishes hypotheses, and tests these hypotheses with in experiments. You should also discuss the overall findings and implications of your results and draw appropriate conclusions. See write up instructions for more details.

The word limit for the report is 3000 words.

Practical Tasks

Get started with the assignment materials and access to HTK on the PPLS AT lab computers (watch the overview video!)
Build a speaker-dependent digit recognition system using a series of bash shell scripts
1. See lab materials from the “Intermission” (week 7) for pointers on shell scripting
Build a speaker-independent digit recognition system, extending the scripts you used for the speaker-dependent system
Develop hypotheses about what training and testing configurations may improve or harm ASR performance for speaker-independent ASR
1. These could be due to issues with the data or the model setup, or both.
2. There are some experiments suggested in the assignment webpages, but you don’t have to stick to those!
Design experiments to test these hypotheses using an existing ‘messy’ data set of previous students’ recordings
Build and test a digit sequence recogniser (optional/bonus marks)
Write up a lab report of your experiments

We will go over the theory behind HMM-based ASR in modules 7-10 of the course. So, you may find yourself running scripts for components you don’t fully understand yet when you start the assignment. This is ok! In the first labs, the main focus is on using scripting to build an ASR pipeline. We’ll go over more details through the next weeks.

How to submit your assignment

You should submit a single written document in pdf format for this assignment – see the pages on writing up for detailed instructions on what to include and how your report should be formatted.

Please be sure to read submission guidance on the PPLS hub.

You must submit your assignment via Turnitin, using the appropriate submission box on the course Learn site. There are separate submission links for the UG and PG version of the course (but they are both linked on the same Learn site)

You can resubmit work up until the original due date. However, if you are submitting after the due date (e.g., you have an extension) you will only be able to submit once – so make sure you are submitting the final version! If in doubt, please ask the teaching office and again, check the guidance on the PPLS hub.

Extensions and Late submissions

Extensions and special circumstances are handled by a centralised university system. Please see more details here: Extensions and Special Circumstance Guidance. This means that the teaching staff for a specific course have no say over whether you can get an extension or not. If you have worries about this please get in touch with the appropriate teaching office (UG, PG) and/or your student adviser.

How will this assignment be marked?

For this assignment, we are looking to assess your understanding of HMM-based ASR works. The report should not be a discursive essay, nor merely documentation of the HTK functions that were used.

You will get marks for:

Completing all parts of the practical, and demonstrating this by completing all requested sections of the report
Writing a clear and concise background section (using figures as appropriate) that shows that you understand the differences between HMM-based ASR in theory and the actual implementation used in practice for this assignment.
Establishing clear and testable hypotheses, explaining why (or why not) you think these hypotheses will hold with specific system configurations and datasets.
- You should use citations to support your claims from the course materials (and potentially your own literature review).
- You should draw on what you’ve learned about phonetics and signal processing to understand why different data subsets may produce different results.
Designing and implementing experiments that test those hypotheses, make the best use of the (limited, messy) data available to you.
Clearly presenting the results of the experiments (using tables, figures and text descriptions) and linking those results back your hypotheses.
Discussing the implications and limitations of your experiments.
Drawing conclusions with an appropriate level of certainty. Can you be that your conclusions are correct? Do you need to do more experiments?
You will also be marked based on the overall coherence and the strength of your argumentation – are your claims well supported by the results and citations were appropriate?

If you do that (i.e., complete all the requested sections of the report), you will likely to get a mark in the good to very good range (as defined by the University of Edinburgh Common Marking Scheme).

You can get higher marks by adding some more depth to your analysis in each section, but particularly the experimental sections. Can you do additional experiments that shed further light on your results and provide further evidence for (or against) your hypotheses?

As usual, we have a positive marking scheme. You will get marks for doing things correctly. If you write something incorrect, it won’t count towards your mark total but you won’t be marked down (in effect, it will be ignored). You will NOT get marked down for typos or grammatical errors, but do try to proofread your work – help your markers to give you the credit you deserve!

Tips for success

Read all the instructions right the way through before you start. You are given all the scripts for the basic part of the practical, so your work is mainly extending exist scripts, preparing data and running experiments. For the speaker-independent and digit sequence recognisers, relatively small changes will be required, starting from copies of the basic scripts. Don’t underestimate how long data cleaning/wrangling can take for your actual experiments.

We suggest that you use a log book to take notes: record each step you perform, so that you have a record for writing your report. Record the experimental conditions and the results carefully. If you modify scripts, record what you did (preferably, make a new script with a meaningful name each time). Be careful and methodical – in the later stages, pay particular attention to which speakers you are training and testing on. Make sure that every result is easily reproducible, simply by running the appropriate script(s).

Working with other students (and machines)

We encourage you to work with each other in the lab, to solve problems you encounter during the exercise – this is a positive thing for both the person doing the helping and the person being helped. Do not cross the line between positive, helpful interaction and cheating.

Note: It’s ok for you to work in pairs and help each other with coding but you must write up your reports independently.

If you do work with someone else, either in scripting or in designing experiments you must note that (including their exam number) in the introduction of your report.

It’s good to recognize that collaboration is an essential part of speech and language technology nowadays. Nobody does everything by themselves. However, if you let someone else do a lot of the technical work now, you may find it puts you back a step later, especially if you are intending to do more study or work in speech and language technology.

You can only really learn to do this sort of practical work by doing it.

If you do work with others, spread the load between you in terms of coding so you all get a chance to practice!

These things are definitely ok and, in fact, are encouraged:

teaching someone how to write shell scripts, in general terms
helping to spot a minor error in someone else’s script
explaining what an HTK program does
discussion of the theory

There’s an old saying that you never really understand something until you teach it. Try that with your fellow students.

These are some of the things you should not be doing:

writing a complete shell script for someone else when they have no idea how it works
copying parts of someone else’s script without understanding what it does
presenting results in your report that you did not help to obtain yourself

Again, you must write up your reports as independently. So, the following are not ok:

helping someone with the content of the written report
reading anyone else’s written report before submission
showing your written report to anyone else before submission
Work with someone else, but not mention that you did in your report

Use of grammar checkers is allowed under university regulations, just as it is fine to ask someone else to proofread your work. However, you should NOT use automated tools such as ChatGPT to generate text for your report (just as it is NOT ok to get someone else to write your report for you). If want to try to use ChatGPT to help learn concepts via dialogue, just be aware that it still often gets very basic things wrong in this area.

How to get help

The first port of call for help is the course labs. The labs for the rest of the semester (modules 7-10) will be devoted to this assignment. It’s generally easier to troubleshoot issues with HTK and shell scripting in person, so do make use of the labs! It’s fine to come to either the 9am or 4pm Wednesday labs (or both as long as there’s room!). Teaching staff will give some overview/demonstrations at the beginning of the lab, so try to come on time.

You can also ask questions on the speech zone forum. The forum is the preferred method for getting asynchronous help, as it means that any clarifications or hints are shared with the whole class. You will need to be logged into speech zone in order to see the assignment specific forums.

You can also come to office hours or book 1-1 appointment with the lecturers, or send us an an email – see contact details in Module 0.

HTK essentials
This is a widely-used toolkit in automatic speech recognition research.