Accessing Festival
The instructions assume you are using the installation of Festival on the computers in the PPLS Appleton Tower (AT) labs. You can work on the computers in the actual physical lab or you can use the the remote desktop service (see Module 0 for instructions).
Note: The PPLS AT Lab computers/remote desktop we are using for this course are completely separate from Informatics remote desktop/DICE! Importantly, you won’t have access to the voice setup we will be using on DICE.
It is possible to install Festival directly onto your computer but this is not necessary for students taking Speech Processing. Installation requires a unix like environment (Linux or MacOS, or a Linux style terminal runnning on Windows) and compiling code from source (see guidance here: Install Festival). If you’ve never compiled code before, and don’t have much experience with the unix command line, your best bet is to use the PPLS AT Lab computers.
Accessing Festival Remotely
You can use the installation of Festival on the Appleton Tower lab servers using the remote desktop service. To connect using the remote desktop, follow the instructions here: Module 0 – computing requirements
Once you’ve started the remote desktop and logged in (with your UUN and EASE password), you can open the Terminal app by going to the menu bar at the top and clicking Applications > System Tools > Terminal. You may want to drag the Terminal icon to the desktop to make it easier for you to find it.
If you accidentally close VNC before you log out, you can reconnect by double clicking on the machine you previously logged onto in VNC viewer.
When you are finished, remember to log out: in the menu at the top of the screen > System > Log out
Assignment Data
If you are using the remote desktop to access the AT lab computers (or are physically in the lab), all the relevant data is already there for you on the linux machines.
If you have installed Festival on your own computer, you will need to get the voice database and dictionaries used to run the voice (voice:cstr_edi_awb_arctic_multisyn, dictionary:unilex). You can find instructions by following this link.
Start Festival
Festival has a command line interface which runs in the terminal (i.e. the unix bash shell). To do this in the PPLS AT lab, you’ll need to:
- Make sure the computer is booted into Linux (if it is in Windows, restart the machine and select the penguin (the Linux mascot!) when presented with the choice);
- open a terminal via Applications > System Tools > Terminal from the menu bar at the top left of the screen. You can drag the Terminal icon from the menu to the desktop if you want to make a shortcut.
Now open a Terminal and run Festival by typing in festival
at the prompt $
:
$ festival
You should see some text about the version of Festival we are using (Festival 2.5.0):
Festival Speech Synthesis System 2.5.0:release December 2017 Copyright (C) University of Edinburgh, 1996-2010. All rights reserved. ..etc
and the prompt will also change to show the following:
festival>
This new prompt means that Festival is running; any commands that you type must now be in the Scheme language and will be interpreted by Festival rather than by the shell.
You will be pleased to know that Festival’s command-line interface uses the same keyboard shortcuts as the bash shell (e.g., TAB
completion, ctrl-a
, ctrl-e
, ctrl-p
, ctrl-n
, up/down/left/right cursor keys, etc.). Here’s a nice cheat sheet for common bash commands. For a comprehensive list of these shortcuts, see the Wikipedia entry for GNU Readline.
If you get into trouble at any point and need to exit the current command, use ctrl-c
. This applies to both Festival and the bash shell.
It’s really worth learning these keyboard shortcuts because they also apply to the bash shell and will save you a lot of time.
Make Festival speak
Synthesise some sentences to become familiar with to the Festival command line.
Festival contains a number of different synthesis engines and for each of these, several voices are available: the quality of synthesis is highly dependent on the particular engine and voice that is being used.
Using the SayText command
By default, Festival will start with a rather old diphone voice, which does not sound great, but is fast and intelligble:
festival> (set! myutt (SayText "Welcome to Festival"))
This command combines a bunch of different things: It converts the input text “Welcome to Festival” to a linguistic specification and uses that specification to generate speech by selecting appropriate diphones. The set! myutt
part of the comment tells Festival to store all the information relating to how the utterance was synthesized in the variable called myutt
.
You can set the voice to the one we will use in the assignment by typing the following after starting festival:
festival> (voice_cstr_edi_awb_arctic_multisyn)
You’ll see some “EST warning” messages printed to the screen, but you can ignore those.
Now try getting generating a sentence as you did before with the SayText
command. Can you hear a difference between the two voices?
Generating speech without playing it
To generate an utterance without playing it, use the following steps instead of SayText:
festival> (set! myutt (Utterance Text "Hello"))
festival> (utt.synth myutt)
Then you can save the utterance myutt a wave file as “myutt.wav” with the following command:
festival> (utt.save.wave myutt "myutt.wav" 'riff)
This will save a file called myutt.wav in whatever directory you started festival in. If you just opened a terminal and started festival without changing directories, you will be in your ‘home’ directory. You can check the folder on the desktop called [your username]’s home and see if a new file wav file has appeared there. Otherwise you can exit festival by pressing ctrl-d
and typing the command pwd
: this will tell you where you currently are – your “present working directory”.
Note: When you issue a command to Festival you must put it in round brackets (...)
– if you do not, it will generate an error. You are using a language called Scheme.
Scheme, and lots of brackets
Scheme is a LISP-like language used as the interface to Festival. When you run Festival in interactive mode, you talk to Festival in Scheme. Fortunately, we’re not going to have to learn too much Scheme. All you need to know for now is that the basic syntax is (function_name argument1 argument2 ...)
.
In Scheme, all functions return a value, which by default is printed after the function completes. The SayText
function returns an Utterance
structure, so that is why #
is printed after the completion of the function. A variable (myutt
in this case) can be set to capture this return value, which will allow us to examine the utterance after processing. This is done using the set!
command (note the two sets of brackets):
festival> (set! myutt (SayText "Welcome to Festival")) #
The TTS process
We can now examine the contents of the myutt
variable. The SayText
function is a high level function which calls a number of other functions in a chain. Each of these functions performs a specific job, such as deciding the pronunciation of a word, or how long a phone should be. We’ll be running these step-by-step later on.
The TTS process in Festival is a pipeline of sub-processes, which build up an Utterance structure in stages. This building process takes the original text as input and adds more and more information, which is stored in the utterance structure. In Festival, a unified mechanism for representing all types of data needed by the system has been developed: this is called the Heterogeneous Relation Graph system, or HRG for short.
Each Relation in an HRG is a structure that links items of a particular linguistic type. For example, we have a Word relation which is a list linking all the words, and a Segment relation which links all the phones etc. Relations can take different forms: the most common types are linear lists and trees.
Each module in Festival takes a number of relations as input and either creates new relations as output, or modifies the input ones. The vast majority of modules only write new information, leaving all information in the input untouched (there are a few exceptions, such as post-lexical processing). Because of this, examining the contents of the relations in an utterance after processing gives an insight into the history of the TTS process.
Different configurations of Festival can use vary with respect to their use of HRGs, and which modules they call.
Examining a saved object
Once you have synthesised an utterance you can do lots of things with it. Here are a few examples.
festival> (utt.play myutt) festival> (utt.relationnames myutt) festival> (utt.relation.print myutt 'Word) festival> (utt.relation.print myutt 'Segment)
You can get a list of the relations that are present in a synthesised utterance by using the utt.relationnames
command.
Relations that are lists can easily be printed to the screen with the utt.relation.print
command. Try this with all of the relations in an utterance. Some of them won’t reveal useful information, others will.
The output from (utt.relation.print myutt 'Word)
may look like this:
() id _3 ; name hello ; pos_index 16 ; pos_index_score 0 ; pos uh ; phr_pos uh ; phrase_score -13.43 ; pbreak_index 1 ; pbreak_index_score 0 ; pbreak NB ; id _4 ; name world ; pos_index 8 ; pos_index_score 0 ; pos nn ; phr_pos n ; pbreak_index 0 ; pbreak_index_score 0 ; pbreak B ; blevel 3 ; nil
Each data line starts with an id number like id _3
then a series of features follow separated by semicolons. Each feature has a name and a value, e.g., feature name: pos
, feature value: uh
.
Examining the processing steps
Tokens – First the text is split into Tokens. Look at the Token relation, where an item is created for each component of the text you input. The Token relation will still have digits and abbreviations in it.
Words – The Tokens are then converted to Words, abbreviations and digits are processed and expanded. Look for this in the Word relation.
Part of Speech Tagging – Each word is tagged with its part of speech, which is added as a feature to the Word relation.
Pronunciation – The pronunciation of each word is determined and the Syllable and Segment relations created. Examine these: the syllable relation is not very interesting as there is very little information here, just a count of the syllables.
You can look up the pronunciation of a word yourself with the function lex.lookup
festival> (lex.lookup "caterpillar") ("caterpillar" nil (((k ae t) 1) ((ax p) 0) ((ih l) 1) ((er) 0)))
The actual pronunciation returned depends on which lexicon a particular voice uses, and whether the word is in the lexicon or if Festival has to predict the pronunciation using letter-to-sound rules.
Try looking up the pronunciation of some real words, and some made up ones.
Accent Prediction – An intonation module assigns pitch accents (and other intonational events) to syllables. A number of different modules exist within Festival, operating with a number of intonation models including Tobi and Tilt. The assignment voice (awb
) doesn’t actually do accent prediction, but you can see what this would look like by trying the older diphone synthesis voice, kal, which does:
To switch to the kal
voice, enter the following in festival:
festival> (voice_kal_diphone)
Now, look at the IntEvent
relation to see which pitch events have been assigned. From the pitch events and the predicted durations, a pitch contour is generated. This contour is a list of numbers which specify the pitch at each point for the resulting waveform. There is no easy way to view the pitch contour.
You can use the following command to change back to the assignment voice:
festival> (voice_cstr_edi_awb_arctic_multisyn)
Waveform generation – The Unit relation is created by making a list of diphones from the segments and the information about the speech needed for synthesis is copied in. The Unit
relation contains features with values in square brackets [...]
These are references to the actual speech used to synthesise these units.
Quit Festival
festival> (quit)
or use ctrl-D
, just like in the bash shell. Festival remembers your command history inbetween sessions (again, just like bash). Next time you start Festival you can use the up cursor key to find previous command, and then hit ‘Enter’ to execute them again. Of course, Festival does not remember the values of variables (e.g., myutt
in the above example) between sessions.
Transferring data from the AT lab servers
To get your data (e.g. generated wav files) from the AT lab servers (e.g. remote desktop) you can either use a terminal based command like rsync
or an SFTP client like FileZilla or WinSCP (graphical interface). For example, the following copies the file myutt.wav that’s in ~/Documents/sp/assignment1 to the directory where you’re running the rsync
command from on your own computer:
rsync -avz s1234567@scp1.ppls.ed.ac.uk:Documents/sp/assignment1/myutt.wav ./
Note [24/10/23]: The file transfer server scp1.ppls.ed.ac.uk
doesn’t appear to be working right now. We are investigating, but for now you can still transfer files by replacing scp1.ppls.ed.ac.uk
with one of the remote desktop addresses, e.g. ppls-atl-1079.ppls.ed.ac.uk
(and similarly for references to scp1.ppls.ed.ac.uk
below). You’ll need to have the university VPN on to do this. You can see the list of PPLS AT lab remote desktop addresses here: https://resource.ppls.ed.ac.uk/whoson/atlab.php
Note: The previous command will only work if you’ve already made the directories Documents/sp/assignment1 in your home directory on the AT lab servers. If you haven’t done this, you can skip this for now and try it after you’ve created some files.
You can share your files with yourself by copying them to OneDrive (or Google Drive) via a browser.
You can also use a file transfer app like FileZilla. In this case, you need to set the remote host to scp1.ppls.ed.ac.uk
. For FileZilla, go to File > Site Manager, then set the protocol to SFTP, the host as scp1.ppls.ed.ac.uk
, and use your UUN as username and EASE password as the password. After connecting you should see your home directory on the AT lab servers as the remote site. You can then drag files from remote site side to the appropriate place in the local site side.
What you should now be able to do
- start Festival and make it speak using
SayText
- capture the Utterance structure returned by
SayText
- look inside the Utterance structure at the Relations
- have an initial understanding of what Relations are, but not yet the full picture
- use some of the keyboard shortcuts that are common to Festival and the bash shell
- save a synthesized utterance as a .wav file and transfer it to your own computer.