Text-to-Speech with Festival
Speech Processing Assignment 1
Due by Monday, 4 November 2024 by 12:00 noon. Submit your report on Learn.
What you will do
In this assignment, you will investigate the generation of synthetic speech from text using the Festival Text-To-Speech (TTS) toolkit. You will identify errors in the generated speech and analyse them: where exactly do they arise in the TTS pipeline, what sorts of problems do they cause, and what are the ramifications and potential solutions?
The lab materials will show you how to load a specific voice, generate speech samples, and inspect what steps were used to generate speech from a specific text input.
You will be focusing on errors generated by a specific voice configuration: cstr_edi_awb_arctic_multisyn
. Please note that this voice is different from the publicly available awb
setup that you might find online (though it uses the same database of recordings). The voice setup that you need to use for the assignment is only available to people in this class, and it has its own specific issues (you’ll soon see it generates many, many errors!).
You are not expected to look into the implementation of Festival to complete this assignment. You are certainly not expected to learn Scheme (the Lisp-like programming language used by the Festival interface)! Don’t focus on Festival as a piece of software, but rather the process of generating speech from text.
What you will submit
Your assignment submission will be a structured report discussing specific types of errors, as well as discussion/reflection on the TTS pipeline as a whole. The word limit for the report is 1500 words.
Why you are doing this
Festival is a widely used concatenative TTS toolkit that implements the type of TTS pipeline that is discussed in the class videos, lectures and readings. From the class materials you should see that each of the steps in a specific TTS voice setup depends on several design, data, and algorithmic choices. Our goal in setting this assignment is to give you more concrete practical experience in how concatenative synthesis works, as well as how specific choices in TTS voice setup can cause issues in speech generated using this method.
Although concatenative synthesis is no longer the state of the art in TTS, you will still hear such systems deployed in applications. Moreover, the issues you will explore when generating speech from text are still highly relevant to understanding even the latest TTS approaches.
Understanding the types of errors that arise in different parts of a TTS pipeline will help you to consolidate what you have learned in modules 1-4. That is: what are the properties of human speech (in terms of articulation and acoustics), and how can we specify and generate speech using a computer?
Practical Tasks
You should work through the practical tasks below, which are expanded on in separate web pages on Speech Zone. You must use the version of Festival installed on the AT PPLS computer lab machines. You can access this in the lab or via the remote desktop service.
- Get started with Festival:
- Use the Festival Synthesis system to generate speech from text
- Follow the instructions to generate speech using a custom unit selection voice
cstr_edi_awb_arctic_multisyn
, which we’ll refer to asawb
from now on. - Generate and save .wav files of the utterances you synthesize
- Step through the TTS pipeline
- View all the steps used in generating speech for this specific voice
awb
. - Examine how the linguistic specification derived from the input text is stored in terms of relations.
- Compare this to the theoretical TTS pipeline discussed in the course materials (i.e., videos, lectures, readings)
awb
is missing some steps from the theoretical pipeline. What are they? What is the result of the way this voice was set up in practice?- You may find it useful to compare the linguistic specification (i.e., relations) derived and used in the
awb
voice to the diphone synthesis voicevoice_kal_diphone
- View all the steps used in generating speech for this specific voice
- Find mistakes
- Find examples of errors that the
awb
voice makes in each of the following categories.- Text normalisation
- POS tagging/homographs
- Pronunciation
- Phrase break detection
- Waveform Generation
- You should focus on (at least) one error example per category for your write up.
- Explain what exactly the error is and how it arises.
- Tip: It will generally be more efficient to think about what errors are likely to occur at the TTS pipeline steps associated with each of the categories above and design specific examples, rather than randomly trying different texts (although generating existing texts is a reasonable starting point if you’re just trying to get a feel for the system).
- Find examples of errors that the
- Reflect on the implications of your error analysis
- While you are exploring different error types, consider what the implications of different categories of errors are. Are some more severe with respect to intelligibility? Or with respect to naturalness?
- Similarly, consider what potential solutions to the errors could be. Could they be fixed within the current setup or would the solution require new, external resources? How generalisable would a specific solution be?
- Write-up your findings
- You should write up one error for each category in your report, as well as a discussion of the implications of the errors (see write up instructions below).
Please read carefully through the pages linked above for further guidance on what exactly you should do (note: there’s no link for point 4 – just think through the questions outlined above).
How to submit your assignment
You should submit a single written document in pdf format for this assignment – see the pages on Writing Up for detailed instructions on what to include and how your report should be formatted.
You must submit your assignment via Turnitin, using the appropriate submission box on the course Learn site. There are separate submission links for the UG and PG version of the course (but they are both linked on the same Learn site)
You can resubmit work up until the original due date. However, if you are submitting after the due date (e.g., you have an extension) you will only be able to submit once – so make sure you are submitting the final version! If in doubt, please ask the teaching office.
Extensions and Late submissions
Extensions and exceptional circumstances cases are handled by a centralised university system. Please see more details here: Exceptional Circumstances Guidance
This means that the teaching staff for a specific course have no say over whether you can get an extension or not. If you have worries about this please get in touch with the PPLS teaching office (UG, PG) and/or your student adviser.
Please note that the Informatics Teaching Office are not able to help with this course as it is administered in PPLS. The PPLS teaching office is able to help though!
How will this assigment be marked?
For this assignment, we are looking to assess your understanding of how (concatenative) TTS works. The report should not be a discursive essay, nor merely documentation of what you typed into Festival and what output you got. We want to see what you’ve learned from working this specific voice setup in Festival and how you can relate that to the theory from lectures.
You will get marks for:
- Completing all parts of the practical, and demonstrating this in the report
- Providing interesting errors made by Festival, with correct analysis of the type of error and of which module made the error.
- Providing an analysis of the severity and implications of different types of errors and how they can effect the usefulness of a TTS voice more generally.
If you do that (i.e., complete all the requested sections of the report), you will likely to get a mark in the good to very good range (as defined by the University of Edinburgh Common Marking Scheme).
You can get higher marks by adding some more depth to your analysis in each section (See write-up guidance).
You will NOT get marked down for typos or grammatical errors, but do try to proofread your work – help your markers to give you the credit you deserve!
Working with other people (and machines)
It is fine to discuss how to use Festival and the components of the TTS pipeline with other people in the class. Similarly, it’s ok to discuss error categories with other people. However, you should come up with your own analyses for the report and write up your work individually.
Use of grammar checkers is allowed under university regulations, just as it is fine to ask someone else to proofread your work. However, you should NOT use automated tools such as ChatGPT to generate text for your report (just as it is NOT ok to get someone else to write your report for you).
If want to try to use ChatGPT to help learn concepts via dialogue, just be aware that it still often gets very basic things wrong in this area. Also the voice we are using is different to the standard Festival voices, so it will likely give you incorrect details of the TTS pipeline/voice setup we are actually using in this assignment. You have an actual expert on this topic in the lab, so you are better off asking them rather than ChatGPT!
In any case, if you use generative AI based tools (such as ChatGPT) to help you in this assignment you MUST declare it. See the write-up instructions for more discussion of this. Your declaration of AI use will not contribute to the word count.
How to get help
The first port of call for help is the course labs. The labs for modules 5 and 6 (Weeks 6 and 7) will be devoted to this assignment. It’s generally easier to troubleshoot issues with Festival in person, so do make use of the labs! It’s fine to come to either the 9am or 4pm Wednesday labs (or both as long as there’s room!).
You can also ask questions on the speech zone forum. The forum is the preferred method for getting asynchronous help, as it means that any clarifications or hints are shared with the whole class. You will need to be logged into speech zone in order to see the assignment specific forums.
However, as ever, you can always also come to office hours or book 1-1 appointment with the lecturers, or send us an an email – see contact details in Module 0.