The Festival text-to-speech system

Festival is a widely used research toolkit for Text-To-Speech. It is not perfect, and your goal is to discover various types of errors it makes, then understand why they occur.

This assignment has been updated for 2024-25 – enjoy!

These milestones will help you stay on track with this assignment. Try to stay ahead of the milestones.

Log in
  1. Overview
    An overview of what you will do for this assignment

    Text-to-Speech with Festival

    Speech Processing Assignment 1

    Due by Monday, 4 November 2024 by 12:00 noon.  Submit your report on Learn. 

    What you will do

    In this assignment, you will investigate the generation of synthetic speech from text using the Festival Text-To-Speech (TTS) toolkit. You will identify errors in the generated speech and analyse them: where exactly do they arise in the TTS pipeline, what sorts of problems do they cause, and what are the ramifications and potential solutions?

    The lab materials will show you how to load a specific voice, generate speech samples, and inspect what steps were used to generate speech from a specific text input.

    You will be focusing on errors generated by a specific voice configuration: cstr_edi_awb_arctic_multisyn. Please note that this voice is different from the publicly available awb setup that you might find online (though it uses the same database of recordings). The voice setup that you need to use for the assignment is only available to people in this class, and it has its own specific issues (you’ll soon see it generates many, many errors!).

    You are not expected to look into the implementation of Festival to complete this assignment. You are certainly not expected to learn Scheme (the Lisp-like programming language used by the Festival interface)! Don’t focus on Festival as a piece of software, but rather the process of generating speech from text.

    What you will submit

    Your assignment submission will be a structured report discussing specific types of errors, as well as discussion/reflection on the TTS pipeline as a whole. The word limit for the report is 1500 words.

    Why you are doing this

    Festival is a widely used concatenative TTS toolkit that implements the type of TTS pipeline that is discussed in the class videos, lectures and readings. From the class materials you should see that each of the steps in a specific TTS voice setup depends on several design, data, and algorithmic choices. Our goal in setting this assignment is to give you more concrete practical experience in how concatenative synthesis works, as well as how specific choices in TTS voice setup can cause issues in speech generated using this method.

    Although concatenative synthesis is no longer the state of the art in TTS, you will still hear such systems deployed in applications. Moreover, the issues you will explore when generating speech from text are still highly relevant to understanding even the latest TTS approaches.

    Understanding the types of errors that arise in different parts of a TTS pipeline will help you to consolidate what you have learned in modules 1-4. That is: what are the properties of human speech (in terms of articulation and acoustics), and how can we specify and generate speech using a computer?

    Practical Tasks

    You should work through the practical tasks below, which are expanded on in separate web pages on Speech Zone. You must use the version of Festival installed on the AT PPLS computer lab machines. You can access this in the lab or via the remote desktop service.

    1. Get started with Festival:
      • Use the Festival Synthesis system to generate speech from text
      • Follow the instructions to generate speech using a custom unit selection voice  cstr_edi_awb_arctic_multisyn, which we’ll refer to as awb from now on.
      • Generate and save .wav files of the utterances you synthesize
    2. Step through the TTS pipeline
      • View all the steps used in generating speech for this specific voice awb.
      • Examine how the linguistic specification derived from the input text is stored in terms of relations.
      • Compare this to the theoretical TTS pipeline discussed in the course materials (i.e., videos, lectures, readings)
      • awb is missing some steps from the theoretical pipeline. What are they? What is the result of the way this voice was set up in practice?
      • You may find it useful to compare the linguistic specification (i.e., relations) derived and used in the awb voice to the diphone synthesis voice voice_kal_diphone
    3. Find mistakes
      • Find examples of errors that the awb voice makes in each of the following categories.
        1. Text normalisation
        2. POS tagging/homographs
        3. Pronunciation
        4. Phrase break detection
        5. Waveform Generation
      • You should focus on (at least) one error example per category for your write up.
      • Explain what exactly the error is and how it arises.
      • Tip: It will generally be more efficient to think about what errors are likely to occur at the TTS pipeline steps associated with each of the categories above and design specific examples, rather than randomly trying different texts (although generating existing texts is a reasonable starting point if you’re just trying to get a feel for the system).
    4. Reflect on the implications of your error analysis
      • While you are exploring different error types, consider what the implications of different categories of errors are. Are some more severe with respect to intelligibility? Or with respect to naturalness?
      • Similarly, consider what potential solutions to the errors could be. Could they be fixed within the current setup or would the solution require new, external resources? How generalisable would a specific solution be?
    5. Write-up your findings
      • You should write up one error for each category in your report, as well as a discussion of the implications of the errors (see write up instructions below).

    Please read carefully through the pages linked above for further guidance on what exactly you should do (note: there’s no link for point 4 – just think through the questions outlined above).

    How to submit your assignment

    You should submit a single written document in pdf format for this assignment – see the pages on Writing Up for detailed instructions on what to include and how your report should be formatted.

    You must submit your assignment via Turnitin, using the appropriate submission box on the course Learn site. There are separate submission links for the UG and PG version of the course (but they are both linked on the same Learn site)

    You can resubmit work up until the original due date. However, if you are submitting after the due date (e.g., you have an extension) you will only be able to submit once – so make sure you are submitting the final version!  If in doubt, please ask the teaching office.

    Extensions and Late submissions

    Extensions and exceptional circumstances cases are handled by a centralised university system. Please see more details here: Exceptional Circumstances Guidance

    This means that the teaching staff for a specific course have no say over whether you can get an extension or not. If you have worries about this please get in touch with the PPLS teaching office (UG, PG) and/or your student adviser.

    Please note that the Informatics Teaching Office are not able to help with this course as it is administered in PPLS.  The PPLS teaching office is able to help though! 

    How will this assigment be marked?

    For this assignment, we are looking to assess your understanding of how (concatenative) TTS works. The report should not be a discursive essay, nor merely documentation of what you typed into Festival and what output you got. We want to see  what you’ve learned from working this specific voice setup in Festival and how you can relate that to the theory from lectures.

    You will get marks for:

    • Completing all parts of the practical, and demonstrating this in the report
    • Providing interesting errors made by Festival, with correct analysis of the type of error and of which module made the error.
    • Providing an analysis of the severity and implications of different types of errors and how they can effect the usefulness of a TTS voice more generally.

    If you do that (i.e., complete all the requested sections of the report), you will likely to get a mark in the good to very good range (as defined by the University of Edinburgh Common Marking Scheme).

    You can get higher marks by adding some more depth to your analysis in each section (See write-up guidance).

    You will NOT get marked down for typos or grammatical errors, but do try to proofread your work – help your markers to give you the credit you deserve!

    Working with other people (and machines)

    It is fine to discuss how to use Festival and the components of the TTS pipeline with other people in the class. Similarly, it’s ok to discuss error categories with other people. However, you should come up with your own analyses for the report and write up your work individually.

    Use of grammar checkers is allowed under university regulations, just as it is fine to ask someone else to proofread your work. However, you should NOT use automated tools such as ChatGPT to generate text for your report (just as it is NOT ok to get someone else to write your report for you).

    If want to try to use ChatGPT to help learn concepts via dialogue, just be aware that it still often gets very basic things wrong in this area. Also the voice we are using is different to the standard Festival voices, so it will likely give you incorrect details of the TTS pipeline/voice setup we are actually using in this assignment.  You have an actual expert on this topic in the lab, so you are better off asking them rather than ChatGPT!

    In any case, if you use generative AI based tools (such as ChatGPT) to help you in this assignment you MUST declare it. See the write-up instructions for more discussion of this.  Your declaration of AI use will not contribute to the word count.

    How to get help

    The first port of call for help is the course labs. The labs for modules 5 and 6 (Weeks 6 and 7) will be devoted to this assignment. It’s generally easier to troubleshoot issues with Festival in person, so do make use of the labs! It’s fine to come to either the 9am or 4pm Wednesday labs (or both as long as there’s room!).

    You can also ask questions on the speech zone forum. The forum is the preferred method for getting asynchronous help, as it means that any clarifications or hints are shared with the whole class.  You will need to be logged into speech zone in order to see the assignment specific forums.

    However, as ever, you can always also come to office hours or book 1-1 appointment with the lecturers, or send us an an email – see contact details in Module 0.

  2. Getting started
    A first look at Festival and how we use it in interactive mode on the command line.

    Accessing Festival

    The instructions assume you are using the installation of Festival on the computers in the PPLS Appleton Tower (AT) labs. You can work on the computers in the actual physical lab or you can use the the remote desktop service (see Module 0 for instructions).

    Note: The PPLS AT Lab computers/remote desktop we are using for this course are completely separate from Informatics remote desktop/DICE!   Importantly, you won’t have access to the voice setup we will be using on DICE. 

    It is possible to install Festival directly onto your computer but this is not necessary for students taking Speech Processing. Installation requires a unix like environment (Linux or MacOS, or a Linux style terminal runnning on Windows) and compiling code from source (see guidance here: Install Festival). If you’ve never compiled code before, and don’t have much experience with the unix command line, your best bet is to use the PPLS AT Lab computers.

    Accessing Festival Remotely

    You can use the installation of Festival on the Appleton Tower lab servers using the remote desktop service. To connect using the remote desktop, follow the instructions here: Module 0 – computing requirements

    Once you’ve started the remote desktop and logged in (with your UUN and EASE password), you can open the Terminal app by going to the menu bar at the top and clicking Applications > System Tools > Terminal. You may want to drag the Terminal icon to the desktop to make it easier for you to find it.

    If you accidentally close VNC before you log out, you can reconnect by double clicking on the machine you previously logged onto in VNC viewer.

    When you are finished, remember to log out: in the menu at the top of the screen > System > Log out

    Assignment Data

    If you are using the remote desktop to access the AT lab computers (or are physically in the lab), all the relevant data is already there for you on the linux machines.

    If you have installed Festival on your own computer, you will need to get the voice database and dictionaries used to run the voice (voice:cstr_edi_awb_arctic_multisyn, dictionary:unilex). You can find instructions by following this link.

    Start Festival

    Festival has a command line interface which runs in the terminal (i.e. the unix bash shell).  To do this in the PPLS AT lab, you’ll need to:

    1. Make sure the computer is booted into Linux (if it is in Windows, restart the machine and select the penguin (the Linux mascot!) when presented with the choice);
    2. open a terminal via Applications > System Tools > Terminal from the menu bar at the top left of the screen.   You can drag the Terminal icon from the menu to the desktop if you want to make a shortcut.

    Now open a Terminal and run Festival by typing in festival at the prompt $:

    $ festival
    

    You should see some text about the version of Festival we are using (Festival 2.5.0):

    Festival Speech Synthesis System 2.5.0:release December 2017
    Copyright (C) University of Edinburgh, 1996-2010. All rights reserved.
    
    ..etc
    

    and the prompt will also change to show the following:

    festival>
    

    This new prompt means that Festival is running; any commands that you type must now be in the Scheme language and will be interpreted by Festival rather than by the shell.

    You will be pleased to know that Festival’s command-line interface uses the same keyboard shortcuts as the bash shell (e.g., TAB completion, ctrl-a, ctrl-e, ctrl-p, ctrl-n, up/down/left/right cursor keys, etc.). Here’s a nice cheat sheet for common bash commands.  For a comprehensive list of these shortcuts, see the Wikipedia entry for GNU Readline.

    If you get into trouble at any point and need to exit the current command, use ctrl-c. This applies to both Festival and the bash shell.

    It’s really worth learning these keyboard shortcuts because they also apply to the bash shell and will save you a lot of time.

    Make Festival speak

    Synthesise some sentences to become familiar with to the Festival command line.

    Festival contains a number of different synthesis engines and for each of these, several voices are available: the quality of synthesis is highly dependent on the particular engine and voice that is being used.

    Using the SayText command

    By default, Festival will start with a rather old diphone voice, which does not sound great, but is fast and intelligble:

    festival> (set! myutt (SayText "Welcome to Festival"))

    This command combines a bunch of different things:  It converts the input text “Welcome to Festival” to a linguistic specification and uses that specification to generate speech by selecting appropriate diphones.  The set! myutt part of the comment tells Festival to store all the information relating to how the utterance was synthesized in the variable called myutt.

    You can set the voice to the one we will use in the assignment by typing the following after starting festival:

    festival> (voice_cstr_edi_awb_arctic_multisyn)

    You’ll see some “EST warning” messages printed to the screen, but you can ignore those.

    Now try getting generating a sentence as you did before with the SayText command.  Can you hear a difference between the two voices?

    Generating speech without playing it

    To generate an utterance without playing it, use the following steps instead of SayText:

    festival> (set! myutt (Utterance Text "Hello"))
    festival> (utt.synth myutt)

    Then you can save the utterance myutt a wave file as “myutt.wav” with the following command:

    festival> (utt.save.wave myutt "myutt.wav" 'riff)

    This will save a file called myutt.wav in whatever directory you started festival in.  If you just opened a terminal and started festival without changing directories, you will be in your ‘home’ directory.  You can check the folder on the desktop called [your username]’s home and see if a new file wav file has appeared there.   Otherwise you can exit festival by pressing ctrl-d  and typing the command pwd: this will tell you where you currently are – your “present working directory”.

    Note: When you issue a command to Festival you must put it in round brackets (...) – if you do not, it will generate an error. You are using a language called Scheme.

    Scheme, and lots of brackets

    Scheme is a LISP-like language used as the interface to Festival. When you run Festival in interactive mode, you talk to Festival in Scheme. Fortunately, we’re not going to have to learn too much Scheme. All you need to know for now is that the basic syntax is (function_name argument1 argument2 ...).

    In Scheme, all functions return a value, which by default is printed after the function completes. The SayText function returns an Utterance structure, so that is why # is printed after the completion of the function. A variable (myutt in this case) can be set to capture this return value, which will allow us to examine the utterance after processing. This is done using the set! command (note the two sets of brackets):

    festival> (set! myutt (SayText "Welcome to Festival"))
    #
    

    The TTS process

    We can now examine the contents of the myutt variable. The SayText function is a high level function which calls a number of other functions in a chain. Each of these functions performs a specific job, such as deciding the pronunciation of a word, or how long a phone should be. We’ll be running these step-by-step later on.

    The TTS process in Festival is a pipeline of sub-processes, which build up an Utterance structure in stages. This building process takes the original text as input and adds more and more information, which is stored in the utterance structure. In Festival, a unified mechanism for representing all types of data needed by the system has been developed: this is called the Heterogeneous Relation Graph system, or HRG for short.

    Each Relation in an HRG is a structure that links items of a particular linguistic type. For example, we have a Word relation which is a list linking all the words, and a Segment relation which links all the phones etc. Relations can take different forms: the most common types are linear lists and trees.

    Each module in Festival takes a number of relations as input and either creates new relations as output, or modifies the input ones. The vast majority of modules only write new information, leaving all information in the input untouched (there are a few exceptions, such as post-lexical processing). Because of this, examining the contents of the relations in an utterance after processing gives an insight into the history of the TTS process.

    Different configurations of Festival can use vary with respect to their use of HRGs, and which modules they call.

    Examining a saved object

    Once you have synthesised an utterance you can do lots of things with it. Here are a few examples.

    festival> (utt.play myutt)
    festival> (utt.relationnames myutt)
    festival> (utt.relation.print myutt 'Word)
    festival> (utt.relation.print myutt 'Segment)
    

    You can get a list of the relations that are present in a synthesised utterance by using the utt.relationnames command.

    Relations that are lists can easily be printed to the screen with the utt.relation.print command. Try this with all of the relations in an utterance. Some of them won’t reveal useful information, others will.

    The output from (utt.relation.print myutt 'Word) may look like this:

    ()
    id _3 ; name hello ; pos_index 16 ; pos_index_score 0 ; pos uh ;
            phr_pos uh ; phrase_score -13.43 ; pbreak_index 1 ;
            pbreak_index_score 0 ; pbreak NB ;
    id _4 ; name world ; pos_index 8 ; pos_index_score 0 ; pos nn ;
            phr_pos n ; pbreak_index 0 ; pbreak_index_score 0 ;
            pbreak B ; blevel 3 ;
    nil
    

    Each data line starts with an id number like id _3 then a series of features follow separated by semicolons. Each feature has a name and a value, e.g., feature name: pos, feature value: uh.

    Examining the processing steps

    Tokens – First the text is split into Tokens. Look at the Token relation, where an item is created for each component of the text you input. The Token relation will still have digits and abbreviations in it.

    Words – The Tokens are then converted to Words, abbreviations and digits are processed and expanded. Look for this in the Word relation.

    Part of Speech Tagging – Each word is tagged with its part of speech, which is added as a feature to the Word relation.

    Pronunciation – The pronunciation of each word is determined and the Syllable and Segment relations created. Examine these: the syllable relation is not very interesting as there is very little information here, just a count of the syllables.

    You can look up the pronunciation of a word yourself with the function lex.lookup

    festival> (lex.lookup "caterpillar")
    ("caterpillar" nil (((k ae t) 1) ((ax p) 0) ((ih l) 1) ((er) 0)))
    

    The actual pronunciation returned depends on which lexicon a particular voice uses, and whether the word is in the lexicon or if Festival has to predict the pronunciation using letter-to-sound rules.

    Try looking up the pronunciation of some real words, and some made up ones.

    Accent Prediction – An intonation module assigns pitch accents (and other intonational events) to syllables. A number of different modules exist within Festival, operating with a number of intonation models including Tobi and Tilt. The assignment voice (awb) doesn’t actually do accent prediction, but you can see what this would look like by trying the older diphone synthesis voice, kal, which does:

    To switch to the kal voice, enter the following in festival:

    festival> (voice_kal_diphone)
    

    Now, look at the IntEvent relation to see which pitch events have been assigned. From the pitch events and the predicted durations, a pitch contour is generated. This contour is a list of numbers which specify the pitch at each point for the resulting waveform. There is no easy way to view the pitch contour.

    You can use the following command to change back to the assignment voice:

    festival> (voice_cstr_edi_awb_arctic_multisyn)
    

    Waveform generation – The Unit relation is created by making a list of diphones from the segments and the information about the speech needed for synthesis is copied in. The Unit relation contains features with values in square brackets [...] These are references to the actual speech used to synthesise these units.

    Quit Festival

    festival> (quit)
    

    or use ctrl-D, just like in the bash shell. Festival remembers your command history inbetween sessions (again, just like bash). Next time you start Festival you can use the up cursor key to find previous command, and then hit ‘Enter’ to execute them again. Of course, Festival does not remember the values of variables (e.g., myutt in the above example) between sessions.

    Transferring data from the AT lab servers

    To get your data (e.g. generated wav files) from the AT lab servers (e.g. remote desktop) you can either use a terminal based command like rsync or an SFTP client like FileZilla or WinSCP (graphical interface). For example, the following copies the file myutt.wav that’s in ~/Documents/sp/assignment1 to the directory where you’re running the rsync command from on your own computer:

    rsync -avz s1234567@scp1.ppls.ed.ac.uk:Documents/sp/assignment1/myutt.wav ./
    

    Note [24/10/23]: The file transfer server scp1.ppls.ed.ac.uk doesn’t appear to be working right now.  We are investigating, but for now you can still transfer files by replacing scp1.ppls.ed.ac.uk with one of the remote desktop addresses, e.g. ppls-atl-1079.ppls.ed.ac.uk (and similarly for references to scp1.ppls.ed.ac.uk below).  You’ll need to have the university VPN on to do this. You can see the list of PPLS AT lab remote desktop addresses here: https://resource.ppls.ed.ac.uk/whoson/atlab.php

    Note: The previous command will only work if you’ve already made the directories Documents/sp/assignment1 in your home directory on the AT lab servers. If you haven’t done this, you can skip this for now and try it after you’ve created some files.

    You can share your files with yourself by copying them to OneDrive (or Google Drive) via a browser.

    You can also use a file transfer app like FileZilla. In this case, you need to set the remote host to scp1.ppls.ed.ac.uk. For FileZilla, go to File > Site Manager, then set the protocol to SFTP, the host as scp1.ppls.ed.ac.uk, and use your UUN as username and EASE password as the password. After connecting you should see your home directory on the AT lab servers as the remote site. You can then drag files from remote site side to the appropriate place in the local site side.

     

    What you should now be able to do

    • start Festival and make it speak using SayText
    • capture the Utterance structure returned by SayText
    • look inside the Utterance structure at the Relations
    • have an initial understanding of what Relations are, but not yet the full picture
    • use some of the keyboard shortcuts that are common to Festival and the bash shell
    • save a synthesized utterance as a .wav file and transfer it to your own computer.
  3. Step-by-step
    It's possible to run each step in the text-to-speech pipeline manually, and inspect what Festival does at each point.

    We are going to examine the speech synthesis process in detail. You will examine the sequence of processes Festival uses to perform text-to-speech (TTS) and relate these processes to what you have learnt in the Speech Processing course. Make notes in a lab book as you work! Remember that Festival is just one example of a TTS system – there are generally many other ways of implementing each step.

    Festival stores information like words, phonemes, F0 targets, syntax, etc. in things called Relations. These approximately correspond to the levels of linguistic representation mentioned in lectures. As each stage in the process is performed, more information is added to the utterance structure, and you can examine this inside Festival. So, each step in the pipeline will either add a new relation, or modify an existing one. Note, however, not all TTS voice configurations generate and use all types of relations (e.g., the awb voice is definitly missing some that the kal voice has).

    Your task in this part of the exercise is to explore the synthesis process and discover what Festival does in each step of processing, when converting text to speech. If you notice errors in the prediction of phrases, pronunciation, the processing of numbers or anything else, make a note of it, as this will be useful for the next part of the exercise.

    Hints

    Three hints for this practical exercise:

    1. Read the instructions through completely before you start.
    2. Use Festival’s tab-completion and command-line history (which is kept even if you quit and restart Festival) to save typing and avoid mistakes.
    3. If things go wrong (either with Festival, or with you), quit Festival and restart it.

    Festival help

    Festival can make your life easier in a number of ways.

    Command history

    You can access commands you have previously typed using the arrow keys. Press the up arrow a number of times to see the previous commands you entered, then use the left and right arrow keys to edit them. Press ENTER to run the edited command.

    TAB completion

    If you start to type a command name into Festival, then press TAB, it will either complete the command for you or give you a list of possible completions. For example, to get a list of all of the commands that work on an utterance type

    festival> (utt.
    

    and then press TAB once or twice.

    Getting help

    Most commands in Festival have built-in help. Type the name of the command (including the initial open bracket) and the press ⟨ALT⟩-h (Hold the ALT key down and press h), or alternatively, press (and release) ESC and then press h. Festival will print some help for that command, including a list of arguments that it expects.

    Make an assignment working directory

    Before you start, make a folder to work in, so you can keep your data organised.  On the PPLS AT lab computers/remote desktop, you can make a directory in ~/Documents/sp/assignment1. We will assume this is your working directory in the rest of these instructions.  You can do this on the command line (i.e., in a terminal) with the following commands.

    cd ~
    mkdir -p ~/Documents/sp/assignment1
    

    Change into this directory, e.g.:

    cd ~/Documents/sp/assignment1

    You can use the pwd command to see which directory you are currently in and the ls command to see which files are in the current directory.

    Starting Festival with a specific voice configuration

    Previously, you started Festival in the default mode.  But for this assignment we want to use a specific voice in a specific configuration.

    We are going to use a unit selection voice, called cstr_edi_awb_arctic_multisyn in this exercise.

    We’re going to use a simple configuration file to tell Festival to load the correct voice, and to add a few extra commands (file location of the AT lab servers):

    cp /Volumes/Network/courses/sp/assignment1/config.scm .
    chmod ugo+r config.scm
    

    If you followed this gist to install festival on your own computer, you probably already downloaded the config.scm file (by default the “assignment1” directory next to the “tools” directory you installed festival (see lines 82-85). In that case, you can start festival from there or move the config.scm file to the directory you want to work in.

    Remember that if you come back later, you only need to cd to your working direcotry (e.g. ~/Documents/sp/assignment1 following the remote desktop instructions). You don’t need to copy the file again as long as you start festival from a directory that contains config.scm. Now, every time you start Festival during the rest of this exercise, do it like this.

    festival config.scm
    

    Compared to the earlier exercises using a diphone voice, Festival will take longer to start when loading this unit selection voice. Why?

    In the following festival> at the beginning of the line, just recreates the festival command line prompt – you just need to type in the bit in parentheses.

    Once Festival is running, check that it speaks:

    festival> (SayText "hello world")
    

    You should hear a reasonably good quality Scottish male voice. If not, you probably forgot to start festival using the config.scm file.

    Synthesising an utterance step-by-step

    Read this section carefully before trying any of the commands.

    So far we have only synthesised utterances from start to finish in one go, using SayText. Now we are going to do it step-by-step.
    First you need to create a new utterance object. The following command creates a new utterance object with the text supplied and sets the variable myutt to point to it. In the following festival> at the beginning of the line, just recreates the festival command line prompt – you just need to type in the bit in parentheses.

    festival> (set! myutt (Utterance Text "Put your own text here."))
    

    Now you can manually run each step of the text-to-speech pipeline – don’t skip any steps (what would happen if you did?). Use a single short utterance of your own when performing this part – make it interesting (e.g., containing some text that needs normalising, a word that is unlikely to be in the dictionary, and so on).

    festival> (Initialize myutt)
    festival> (Text myutt)
    festival> (Token_POS myutt)
    festival> (Token myutt)
    festival> (POS myutt)
    festival> (Phrasify myutt)
    festival> (Word myutt)
    festival> (Pauses myutt)
    festival> (PostLex myutt)
    festival> (Wave_Synth myutt)
    festival> (utt.play myutt)
    

    If you get an error, you will have to start again by creating a new utterance with the set! command. If you get confused, quit Festival and start from the beginning again.

    Note that running the synthesis pipeline step-by-step is just to help you understand what is happening. You might need it to diagnose some mistakes later on, but most of the time, you will just use SayText.

    Commands for examining utterances

    You should pause to examine the contents of myutt between each step.

    To determine which relations are now present:

    festival> (utt.relationnames myutt)
    

    and to examine a particular Relation (if it exists):

    festival> (utt.relation.print myutt 'Phrase)
    festival> (utt.relation.print myutt 'Word)
    festival> (utt.relation.print myutt 'Segment)
    

    and so on for any other Relations that exist.

    You can also use the following command to see the overall structure of the utterance:

    festival> (print_sylstructure myutt)
    

    This will show you how the different relations tie together. It will show you Words, Syllables as lists of segments, and the presence of stress.

    Concentrate on discovering which commands create or modify Relations in the utterance structure, and what information is stored in those Relations. Note: the initialize command will not reveal anything interesting, and it may be difficult to see what the Pauses and PostLex commands do.

    What you should now be able to do

    • start Festival and load a configuration file (which is just a sequence of Scheme commands to run after startup)
    • Make full use of keyboard shortcuts including: TAB completion, ctrl-A, ctrl-E, ctrl-P, ctrl-N, ctrl-R, up/down cursor keys to navigate the command history, left/right cursor keys to edit a command.
    • run the pipeline step-by-step
    • describe which Relations are added, or modified, by each step
    • understand that Relations are composed of Items
    • understand that Items are data structures containing an unordered set of key-value pairs
    • have an initial understanding of what some (but not all) of the keys and values mean (e.g., POS tags in the Word relation)
  4. Finding mistakes
    Festival makes mistakes, of course. Your task is to find interesting ones, and explain why each occurs.

    Starting Festival

    Just a reminder that every time you start Festival during this exercise, make sure you use the right voice config file so that you’re finding errors in the right voice (awb).  Remember to change to the directory where you placed the config.scm file:

    $ festival config.scm
    

    Saving waveforms from Festival

    You’ll want to save at least some of your synthesised utterances for further analysis, e.g. in Praat.  Since you have a fully synthesised utterance object in Festival, it is possible to extract the waveform to a file as follows:

    festival>(utt.save.wave myutt "myutt.wav" 'riff)
    

    myutt should be the name of the utterance object, myutt.wav is the filename, which you can choose; if you save more than one waveform, then give them different names. You can now view and analyse the waveform in Praat or Wavesurfer.

    Explaining mistakes made by Festival

    Using what you have learned about Festival, you can now find some examples of it making errors for English text-to-speech. Find examples in each of the following categories:

    • Text normalisation
    • POS tagging/homographs
    • Phrase break prediction
    • Pronunciation (dictionary or letter-to-sound)
    • Waveform generation

    At this point, you might be thinking that the mistakes are simply because Festival is rather old, and that more recent TTS systems will not make such mistakes. So here are two examples from October 2024:

    1. Google voice search reading the text “To permanently disable Live Photos on an iPhone, you can do the following: Go to Settings; Toggle the switch next to Live Photos.” (this is a WaveNet model running on Google’s servers)
    2. Apple iOS reading the text “Your next appt at Christopher Sale Dentistry Ltd is on 28/04/25…” (this is presumed to be running on-device, which is an old iPhone 8 in this case)

     

    In your explorations with Festival, aim for a variety of different types of errors, with different underlying causes: 1 error for each of the front-end categories, (see the assignment overview). Don’t report lots of errors of the same type.

    Be sure that you understand the differences between these various types of error. For example, when Festival says a word incorrectly, it might not be a problem with the pronunciation components (dictionary + letter-to-sound) – it could be a problem earlier or later in the pipeline. You need to play detective and be precise about the underlying cause of every error you report.

    You might discover errors in other categories too. That’s fine: you may wish to incorporate some discussion of this in the last section of your report (discussing the TTS pipeline as a whole) .

    Use the SayText command to synthesise text that you think will produce errors given what you’ve already found out about the awb voice and the TTS pipeline it uses.   If you’re stuck you might try generating sentences from an external text for inspiration (e.g., from a news website, or a novel).

    For each utterance you analyse, store the results in a variable. You will need to examine the contents of this utterance structure to decide what type each error is.

    You will often be able to identify the source of the error from the relations generated by SayText directly (i.e. without running the pipeline step-by-step).  However, in some cases (but not all – it will be too slow), you may also need to run Festival step-by-step (as in the previous part of the exercise).  The crucial thing for the write-up is that you provide evidence of where and why the error occured.

    Skills to develop in this assignment

    • use SayText to synthesise lots of different sentences
    • precisely pinpoint audible errors in the output (e.g., which word, syllable or phone)
    • understand that errors made by later steps in the pipeline might be caused by erroneous input; in other words, the mistake actually happened earlier in the pipeline
    • understand that mistakes can happen in both the front end and the waveform generator
    • make a hypothesis about the cause of the mistake
    • trace back through the pipeline to find the earliest step at which something went wrong
    • obtain evidence to test your hypothesis, by inspecting the Utterance and/or the waveform and/or the spectrogram
  5. Writing up
    Your submission for this assignment will be a written report, analyzing errors made by a specific Festival TTS configuration and discussing the pipeline as a whole.
    Log in
    1. Report Write-up
      Write up your findings from the exercise in a lab report, to show that you can make connections between theory and practice.

      Report Write-up

      The assignment will be assessed based on your written report. The lab report should have a clear structure and be divided into 6 sections.

      1. Text Normalization
      2. POS tagging/Homographs
      3. Pronunciation
      4. Phrase break detection
      5. Waveform generation
      6. TTS pipeline discussion

      Each section will be worth 10 marks (60 marks total).  This assignment is worth 30% of your final course mark.

      The first 5 sections of of your report should focus on identifying and discussing specific types of errors for the unit selection voice (awb) that you investigate in the labs. The last section (6), we ask you to reflect on what you’ve discovered about the TTS pipeline for the awb voice more generally.

      Please note: In 2024-25, you do not have to include a separate background section in this assignment report (unlike  in some previous years).  You may still find it helpful to draft out some text and/figures describing how the unit selection TTS pipeline works in practice and theory, even though it won’t be submitted as part of your final report.  You can use it to check your understanding with the lecturers and tutors.

      Of course, you should incorporate some background material about the Festival pipeline in your error analyses and discussion text.

      More details on what you should include for each section are given below.

      Sections 1-5: Error Analyses

      For the each of the error analysis sections (1-5), you should at least:

      • Clearly describe exactly the error you have found. That is, tell us:
        • What the text input to Festival was
        • What the expected (i.e., correct) output should have been
        • What the actual output was
      • Clearly explain what TTS pipeline component is responsible for the error and why the error happened in that component. Provide evidence to support your explanation.

      Try to be precise in pointing out how the error appears in the linguistic specification and/or the generated waveform (as appropriate). You may find it useful (in some cases necessary) to use tables or figures to illustrate the errors or explanations of where they came from.

      Beyond this, you can get more marks for showing more depth of analysis, for example:

      • Further analysis/more evidence of mistakes’ origins, e.g. if it’s due to interactions of modules
      • Analysis of severity of mistakes, e.g., Would it affect output frequently or only rarely?
      • Further analysis of the error from a phonetics perspective, e.g., why does it sound bad? Why might the error affect speech intelligibility or naturalness? Is the error something a human speaker might also make?
      • Discussion of potential solutions
      • Use of citations to further support your claims/argument
      • Other relevant insights into the errors, especially those that highlight differences between theory and practice.
      • Use of figures, tables, other visualizations to make your answer more informative. It will be much easier to report some types of errors with tables, but some will require figures (e.g., waveform errors).

      You can describe more than one error per category, but note that you’ll likely get more marks for showing more depth of understanding about the category as a whole. So, several very similar examples will probably not get you as much as an in-depth analysis of a specific example that really shows that you understand what that particular part of the TTS pipeline does (and potentially how it interacts with other parts of the TTS process).

      Section 6: TTS pipeline discussion

      In the TTS pipeline discussion section, you should discuss the overall implications of your error analysis. When formulating your answer, consider the questions raised in the task description.

      When writing up your discussion, imagine that you are a consultant and have been asked to provide recommendations to the company that produced this voice. They have a limited budget and limited time to spend on improving the voice. What should they focus their attention on to improve the voice? What are the most important problems to fix? What are the potential tradeoffs between the usefulness or generalisability of potential solutions, and the effort that would be needed to implement those solutions?

      Assume here that the company is committed to sticking with a unit selection voice, so focus on recommendations regarding this voice (i.e., don’t just say change to use a neural network based approach!)

      You can consider errors that you found that you didn’t write up in the first part of the report here. However, if you do introduce new error examples in this discussion, you will need to explain them clearly.

      Bibliography

      You should provide references for any claims you make outside of what has been covered in the essential course materials. It is not necessary to cite course lectures or videos, though you may wish to reference the readings and other related works.

      You can gain extra marks for bringing in insights from related work to strengthen your argumentation (i.e., academic papers that are not on the essential reading list). However, you should be able to complete this assignment to a good standard working just with the core course materials.

      Writing style

      The writing style of the report should be similar to that of a journal paper. Don’t list every command you typed! Say what you were testing and why, what your input to Festival was, what Festival did and what the output was. Use diagrams (e.g., to explain parts of the pipeline, or to illustrate a linguistic structures) and annotated waveform and spectrogram plots to illustrate your report. It may not be appropriate to use a waveform or spectrogram to illustrate a front-end mistake. Avoid using verbatim output copied from the Terminal, unless this is essential to the point you are making.

      Additional tips

      Give the exact text you asked Festival to synthesise, so that the reader/marker can reproduce the mistakes you find in Festival (this includes punctuation!). Always explain why each of the mistakes you find belongs in a particular category. For example, differentiate carefully between

      • part of speech prediction errors that cause the wrong entry in the dictionary to be retrieved
      • errors in letter-to-sound rules
      • waveform generation problems (e.g. an audible join)

      Since the voice you are using in Festival is Scottish English, it is only fair to find errors for that variety of English, so take care with your spelling of specific input texts! You may find it helpful to listen to some actual speech from the source: Prof Alan Black. Quite conveniently for us, you can in fact listen to Alan Black talk about TTS to study his voice and the subject matter at the same time!

      Working with other people (and machines)

      It is fine to discuss how to use Festival and the components of the TTS pipeline with other people in the class. Similarly, it’s ok to discuss error categories with other people. However, you should come up with your own analyses for the report and write up your work individually.

      Use of grammar checkers is allowed under university regulations, just as it is fine to ask someone else to proofread your work. However, you should NOT use automated tools such as ChatGPT to generate text for your report (just as it is NOT ok to get someone else to write your report for you).

      If want to try to use ChatGPT to help learn concepts via dialogue, just be aware that it still often gets very basic things wrong in this area. Also the voice we are using is different to the standard Festival voices, so it will likely give you incorrect details of the TTS pipeline/voice setup we are actually using in this assignment.

    2. Marking Guidance
      Some guidance on how this assignment will be assessed

      How will this assigment be marked?

      For this assignment, we are looking to assess your understanding of how (concatenative) TTS works. The report should not be a discursive essay, nor merely documentation of what you typed into Festival and what output you got. We want to see what you’ve learned in the lab and how you can relate that to the theory from lectures.

      You will get marks for:

      • Completing all parts of the practical, and demonstrating this in the report
      • Providing interesting errors made by Festival, with correct analysis of the type of error and of which module made the error.
      • Providing an analysis of the severity and implications of different types of errors and how they can effect the usefulness of a TTS voice more generally.

      If you do that (i.e., complete all the requested sections of the report), you will likely to get a mark in the good to very good range (as defined by the University of Edinburgh Common Marking Scheme).

      As mentioned above, you can get higher marks by adding some more depth to your analysis in each section (See report write-up guidance).

      We use a positive marking approach in Speech Processing.  That means that we look for things to give you marks for rather than taking away points from a maximum score.  So, if you write something incorrect, it will be ignored in terms of assigning marks.  However, markers may provide comments explaining why something you wrote isn’t quite right.

      You will not get marked down for typos or grammatical errors, but do proofread your work. Clear, easy-to-follow writing will help your markers give you the credit you deserve!

       
    3. Formatting instructions
      Specification of report formatting: word limits and other rules that you must follow

      Report Format

      You must submit a single document in PDF format. When submitting to Learn, the name of the file you upload and the electronic submission title must be in the format “{exam number}_{lab report word count}” and nothing else (e.g., “B012345_1459.pdf”)

      At the beginning of the document, you must include:

      • The title of your report
      • Your exam number (i.e., the one that usually starts with “B”, not your student number that starts with “s”)
      • The word count for your report

      Write your exam number at the top of every page (as a header)

      Marking is strictly anonymous. Do not include your name or your student number – only provide your exam id!

      Sections

      For the rest of the report please use the section headings as follows (and as discussed in report write-up guidance)

      1. Text Normalization
      2. POS tagging/Homographs
      3. Pronunciation
      4. Phrase break detection
      5. Waveform generation
      6. TTS pipeline discussion

      Please also include two additional sections which do not count towards the word limit:

      • Bibliography
      • Declaration of AI use

      Fonts and spacing

      Please use a line spacing of 1.5 and a minimum font size of 11pt (and that applies to all text, including within figures, tables, and captions)

      Word limit

      The word limit for this assignment is is 1500 words. This includes:

      • Main body text
      • Footnotes
      • Figure and table captions

      This excludes:

      • Section headings
      • Text within figures and tables, such as examples, numerical data, and phonetic transcriptions
      • The bibliography, and in-text citations

      In 2024-25, the 1500 word limit is strict – there is no +10% allowance for word count.  You can, of course, submit less than 1500 words.  However, markers reserve the right to enact a penalty for going  over the word limit or to simply not read anything over the word limit. If applied, the mark penalty will be 1 point per 100 words.

      Each section of the report is worth the same amount, so you should aim to write around the same amount for each section (i.e., around 250 words). However, there is no specific word limit per section (just the overall 1500 word limit for the whole report)

      Note: we are aware that Turnitin generally produces inflated word counts. Thus, it is important that you report your accurate word count at the beginning of your report. If you use Overleaf for creating your report in LaTeX, you should be fine using the word count provided there.   Markers will generally accept the word count you give as accurate.  However, they may choose to do their own word count if they feel that the given word count is not accurate.

      While there is a word limit, there is no page limit. Large margins are fine but avoid blank pages

      Figure, graphs and tables

      You should ensure that figures and graphs are large enough to read easily and are of high-quality (with a very strong preference for vector graphics, and failing that high-resolution images). There is no page limit, so therefore there is no reason to have very small figures!

      Your work may be marked electronically or we may print hardcopies on A4 paper, so it must be legible in both formats. In particular, do not assume the markers can “zoom in” to make the figures larger.

      You are strongly advised to draw your own figures which will generally attract a higher mark than a figure quoted from another source. Figures which are clearly just screenshots or copies of other papers/textbooks won’t get you any extra marks.

      Tables must have column or row headers, as appropriate.

      There is no limit on the number of figures, graphs, and tables you can include. However, any figures/graphs/tables you do include should be properly referred to and discussed in the main text. All of these should have proper captions.

      References

      We generally prefer APA style references (i.e. Author, Year). Other citation styles are also ok as long as they are implemented correctly and consistently.

      Declaration of use of AI

      Please briefly describe if you use any external Artificial Intelligence (AI) related tools in doing this assignment.  This includes grammar checking and use of generative AI based chat apps to investigate the topic.  If you use tools based on large language models (e.g. ChatGPT), describe the prompts that you used in doing this work.   If you did not use these tools in your work, you can simply write “none”.

      Text in this section does not count towards the 1500 word limit.

      Remember, you should not use generative AI to generate text for your submitted report.  Please see the course AI policy here:  Speech Processing AI policy.  Please be honest!  You will not be penalised for use of these tools as long as they adhere to this policy.

    4. Tips on writing
      These apply to the lab report for this exercise, but will help you improve your scientific writing more generally. There are tips about both content and writing style.

      1. Content of the lab report

      These tips are provided as a set of slides. They concentrate on content. There is no video for this topic.

      Warning: The actual format of the report has changed since these slides were made.  However, they still give some good general advice on writing.

      2. Scientific Writing – Style and presentation

      There are slides for this video.

      3. Scientific Writing – Saying more, in less space, and with fewer words

      There are slides for this video.

      4. Scientific Writing – figures, graphs, tables, diagrams, …

      There are slides for this video.

      5. Avoiding ambiguity

      These tips are provided as a set of notes. They concentrate on the word “this”. There is no video for this topic.


      Related forum

      Viewing 15 topics - 1 through 15 (of 29 total)
      Viewing 15 topics - 1 through 15 (of 29 total)
      • You must be logged in to create new topics.

Forums for this assignment

(Only accessible if you log in)

Private

  • You do not have permission to view this forum.