Overview

An overview of what you will do for this assignment

Text-to-Speech with Festival

Speech Processing Assignment 1

Due date (2023-24 session): Monday 6 November 2023, 12 noon. Submit your report on Learn.

What you will do

In this assignment, you will investigate the generation of synthetic speech from text using the Festival Text-To-Speech (TTS) toolkit. You will examine errors in the generated speech and analyse them in terms of where exactly they arise in the TTS pipeline, what sorts of problems they cause, and what the potential ramifications and solutions might be.

The lab materials will show you how to load a specific voice, generate speech samples, and inspect what steps were used to generate speech from a specific text input.

You will be focusing on errors generated by a specific voice configuration: cstr_edi_awb_arctic_multisyn. Please note that this voice is different from the publicly available awb setup that you might find online (though it uses the same database of recordings). The voice setup that you need to use for the assignment is only available to people in this class, and it has its own specific issues (you’ll soon see it generates many, many errors!).

You are not expected to look into the actual Festival implementation to complete this assignment. You are certainly not expected to learn Lisp/Scheme (the programming language Festival is built on)!

What you will submit

Your assignment submission will be a structured report discussing specific types of errors, as well as discussion/reflection on the TTS pipeline as a whole. The word limit for the report is 1500 words.

Why you are doing this

Festival is a widely used concatenative TTS toolkit that implements the type of TTS pipeline that is discussed in the class videos, lectures and readings. From the class materials you should see that each of the steps in a specific TTS voice setup depends on several design, data, and algorithmic choices. Our goal in setting this assignment is to give you more concrete practical experience in how concatenative synthesis works, as well as how specific choices in TTS voice setup can affect can cause issues in speech generated using this method.

Although concatenative synthesis is no longer the state of the art in TTS, you will still hear many such systems deployed in the real world. Moreover, the issues you will explore in trying to map from text to speech are still relevant to understanding new TTS approaches – why do some TTS voices sound better than others (even if they are generated using much more sophisticated methods!).

Understanding the types of errors that arise in different parts of a TTS pipeline will also help you to consolidate what you have learned in modules 1-4. That is, what are the properties of human speech (in terms of articulation and acoustics), and how we might specify and generate this on a computer.

Practical Tasks

You should work through the practical tasks below, which are expanded on in separate web pages on Speech Zone. We recommend you use the version of Festival installed on the AT PPLS computer lab machines. You can access this in the lab or via the remote desktop service.

Get started with Festival:
- Use the Festival Synthesis system to generate speech from text
- Follow the instructions to generate speech using a custom unit selection voice cstr_edi_awb_arctic_multisyn, which we’ll refer to as awb from now on.
- Generate and save .wav files of the utterances you synthesize
Step through the TTS pipeline
- View all the steps used in generating speech for this specific voice awb.
- Examine how the linguistic specification derived from the input text is stored in terms of relations.
- Compare this to the theoretical TTS pipeline discussed in the course materials (i.e., videos, lectures, readings)
- awb is missing some steps from the theoretical pipeline. What are they? What is the result of the way this voice was set up in practice?
- You may find it useful to compare the linguistic specification (i.e., relations) derived and used in the awb voice to the diphone synthesis voice voice_kal_diphone
Find mistakes
- Find examples of errors that the awb voice makes in each of the following categories.
  1. Text normalisation
  2. POS tagging/homographs
  3. Pronunciation
  4. Phrase break detection
  5. Waveform Generation
- You should focus on (at least) one error example per category for your write up.
- Explain what exactly the error is and how it arises.
- Tip: It will generally be more efficient to think about what errors are likely to occur at the TTS pipeline steps associated with each of the categories above and design specific examples, rather than randomly generate text (although generating existing texts is a reasonable starting point if you’re just trying to get a feel for the system).
Reflect on the implications of your error analysis
- While you are exploring different error types, consider what the implications of different categories of errors are. Are some more severe with respect to intelligibility? Or with respect to naturalness?
- Similarly, consider what potential solutions to the errors could be. Could they be fixed within the current setup or would the solution require new, external resources? How generalisable would a specific solution be?
Write-up your findings
- You should write up one error for each category in your report, as well as a discussion of the implications of the errors (see write up instructions below).

Please read carefully through the pages linked above for further guidance on what exactly you should do (note: there’s no link for point 4 – just think through the questions outlined above).

How to submit your assignment

You should submit a single written document in pdf format for this assignment – see the pages on Writing Up for detailed instructions on what to include and how your report should be formatted.

You must submit your assignment via Turnitin, using the appropriate submission box on the course Learn site. There are separate submission links for the UG and PG version of the course (but they are both linked on the same Learn site)

You can resubmit work up until the original due date. However, if you are submitting after the due date (e.g., you have an extension) you will only be able to submit once – so make sure you are submitting the final version! If in doubt, please ask the teaching office.

Extensions and Late submissions

Extensions and special circumstances are handled by a centralised university system. Please see more details here: Extensions and Special Circumstance Guidance

This means that the teaching staff for a specific course have no say over whether you can get an extension or not. If you have worries about this please get in touch with the appropriate teaching office (UG, PG) and/or your student adviser.

How will this assigment be marked?

For this assignment, we are looking to assess your understanding of how (concatenative) TTS works. The report should not be a discursive essay, nor merely documentation of what you typed into Festival and what output you got. We want to see what you’ve learned from working this specific voice setup in Festival and how you can relate that to the theory from lectures.

You will get marks for:

Completing all parts of the practical, and demonstrating this in the report
Providing interesting errors made by Festival, with correct analysis of the type of error and of which module made the error.
Providing an analysis of the severity and implications of different types of errors and how they can effect the usefulness of a TTS voice more generally.

If you do that (i.e., complete all the requested sections of the report), you will likely to get a mark in the good to very good range (as defined by the University of Edinburgh Common Marking Scheme).

You can get higher marks by adding some more depth to your analysis in each section (See write-up guidance).

You will NOT get marked down for typos or grammatical errors, but do try to proofread your work – help your markers to give you the credit you deserve!

Working with other people (and machines)

It is fine to discuss how to use Festival and the components of the TTS pipeline with other people in the class. Similarly, it’s ok to discuss error categories with other people. However, you should come up with your own analyses for the report and write up your work individually.

Use of grammar checkers is allowed under university regulations, just as it is fine to ask someone else to proofread your work. However, you should NOT use automated tools such as ChatGPT to generate text for your report (just as it is NOT ok to get someone else to write your report for you).

If want to try to use ChatGPT to help learn concepts via dialogue, just be aware that it still often gets very basic things wrong in this area. Also the voice we are using is different to the standard Festival voices, so it will likely give you incorrect details of the TTS pipeline/voice setup we are actually using in this assignment.

How to get help

The first port of call for help is the course labs. The labs for modules 5 and 6 (Weeks 6 and 7) will be devoted to this assignment. It’s generally easier to troubleshoot issues with Festival in person, so do make use of the labs! It’s fine to come to either the 9am or 4pm Wednesday labs (or both as long as there’s room!).

You can also ask questions on the speech zone forum. The forum is the preferred method for getting asynchronous help, as it means that any clarifications or hints are shared with the whole class. You will need to be logged into speech zone in order to see the assignment specific forums.

However, as ever, you can always also come to office hours or book 1-1 appointment with the lecturers, or send us an an email – see contact details in Module 0.