Report Write-up and Marking Guidance

Write up your findings from the exercise in a lab report, to show that you can make connections between theory and practice. This page gives guidance for what to include in each section and how it will be marked.

How will this assignment be marked?

For this assignment, we are looking to assess your understanding of how (concatenative) TTS works. The report should not be a discursive essay, nor merely documentation of what you typed into Festival and what output you got. We want to see what you’ve learned in the lab and how you can relate that to the theory from lectures.

You will get marks for:

  • Completing all parts of the practical, and demonstrating this in the report
  • Providing interesting errors made by Festival, with correct analysis of the type of error and of which module made the error.
  • Providing a comparison with a more recent TTS system for an example of one error that the more recent system makes.
  • Providing an analysis of the severity and implications of different types of errors and how they can affect the usefulness of a TTS voice more generally.

If you do that (i.e., complete all the requested sections of the report), you will likely to get a mark in the good to very good range (as defined by the University of Edinburgh Common Marking Scheme).

You can get higher marks by adding some more depth to your analysis in each section (see below).

We use a positive marking approach in Speech Processing.  That means that we look for things to give you marks for rather than taking away points from a maximum score.  So, if you write something incorrect, it will be ignored in terms of assigning marks.  However, markers may provide comments explaining why something you wrote isn’t quite right.

You will not get marked down for typos or grammatical errors, but do proofread your work. Clear, easy-to-follow writing will help your markers give you the credit you deserve!

Write-up & Marking Guidance for Report Sections

The assignment will be assessed based on your written report. The lab report should have a clear structure and be divided into 6 sections.

  1. Text Normalization/Word Sense Disambiguation
  2. Pronunciation
  3. Phrase break detection
  4. Waveform generation
  5. Comparison to a recent TTS system
  6. TTS pipeline discussion

You can get up to 10 marks for each of these 6 sections (60 marks total).  This assignment is worth 35% of your final course mark.

  • Sections 1-4 of your report should focus on identifying and discussing specific types of errors for the unit selection voice (awb) that you investigate in the labs.
  • Section 5 should provide your comparison with a newer TTS system (i.e., does Festival make the same error? Why/why not?).
  • In Section 6, we ask you to reflect on what you’ve discovered about the TTS pipeline for the awb voice more generally.

More details of what to include in each section is outlined below.

You will need to incorporate some background material about the Festival pipeline in your error analyses and discussion text.  However, you do not have to include a separate introduction or background section in this assignment report (unlike in some previous years).  You may still find it helpful to draft out some text and/figures describing how the unit selection TTS pipeline works in practice and theory, even though it won’t be submitted as part of your final report.  You can use it to check your understanding with the lecturers and tutors.

Sections 1-4: Festival Error Analyses [4 x 10 marks]

For the each of the error analysis sections (1-4), you should at least:

  • [3 marks] Clearly describe exactly the error you have found. That is, tell us:
    • What the text input to Festival was
    • What the expected (i.e., correct) output should have been
    • What the actual output was
  • [3 marks] Clearly explain what TTS pipeline component is responsible for the error and why the error happened in that component.
    • Provide evidence to support your explanation.
  • [4 marks] You can get more marks for showing more depth of analysis, for example:
    • Further analysis/more evidence of mistakes’ origins, e.g. if it’s due to interactions of modules
    • Analysis of severity of mistakes, e.g., Would it affect output frequently or only rarely?
    • Further analysis of the error from a phonetics perspective, e.g., why does it sound bad? Why might the error affect speech intelligibility or naturalness? Is the error something a human speaker might also make?
    • Discussion of potential solutions
    • Use of citations to further support your claims/argument
    • Other relevant insights into the errors, especially those that highlight differences between theory and practice.
    • Use of figures, tables, other visualisations to make your answer more informative. It will be much easier to report some types of errors with tables, but some will require figures (e.g., waveform errors).

Try to be precise in pointing out how the error appears in the linguistic specification and/or the generated waveform (as appropriate). You may find it useful (in some cases necessary) to use tables or figures to illustrate the errors or explanations of where they came from.

You can describe more than one error per category, but note that you’ll likely get more marks for showing more depth of understanding about the category as a whole. So, several very similar examples will probably not get you as much as an in-depth analysis of a specific example that really shows that you understand what that particular part of the TTS pipeline does (and potentially how it interacts with other parts of the TTS process).

Section 5: Comparison with a recent TTS system [10 marks]

You can focus on one example from a more recent (non-Festival) TTS system in this section.

  • [3 marks] Clearly describe exactly the error you have found. That is, tell us:
    • What the text input to the non-Festival TTS system you are using
    • What the expected (i.e., correct) output should have been
    • What the actual output was
  • [2 marks] Clearly show what Festival generates with the same input and explain if it generates the same (or different) errors
  • [3 marks] Explain and compare the two outputs:
    • If Festival creates the same error, explain which TTS pipeline component is responsible for the error and why the error happened in that component. Provide evidence to support your explanation.
    • If Festival doesn’t create the same error, explain why this is the case.  For example, did one of the pipeline components correct for the error?
  • [2 marks] As for the previous sections, you can get more marks for showing more depth of analysis.
    • You are not expected to fully understand how the more recent TTS systems work at this point.  However, you may get bonus marks for a well reasoned guess for why the error occurred and highlighting how the more recent systems differ from the Festival setup you have for this assignment.

Caution: Remember that this section as a whole is only worth 10 marks, so reading a lot of papers about newer TTS models for this section will probably not give you a lot of gains!  Make sure you finish the rest of the assignment before going for extra bonus marks.

Section 6: TTS pipeline discussion [10 marks]

In the TTS pipeline discussion section, you should discuss the overall implications of your error analysis. When formulating your answer, consider the questions raised in the task description.

When writing up your discussion, imagine that you are a consultant and have been asked to provide recommendations to the company that produced this voice. They have a limited budget and limited time to spend on improving the voice. What should they focus their attention on to improve the voice? What are the most important problems to fix? What are the potential tradeoffs between the usefulness or generalisability of potential solutions, and the effort that would be needed to implement those solutions?

Assume here that the company is committed to sticking with a unit selection voice, so focus on recommendations regarding this voice (i.e., don’t just say change to use a neural network based approach!)

You can consider errors that you found that you didn’t write up in the first part of the report here. However, if you do introduce new error examples in this discussion, you will need to explain them clearly.

As this is a discussion, this section will be marked more holistically (out of 10 marks), but here is a rough guide for mark bands:

  • 1-3 (poor): Something has been submitted for this section, but it’s only marginally relevant or has a lot of incorrect statements (hence don’t count under the positive marking scheme)
  • 4 (fair): You’ve had a go at outlining the overall problems in the pipeline but it’s more of a straight summary of what has been found than a discussion over the overall implications.   The analysis is lacking additional insights into what should be prioritized to improve the voice. Recommendations are perhaps vague or very generic.
  • 5 (good): You’ve identified some overall issues with the pipeline and have attempted to make some recommendations that show understanding of the pipeline as a whole. Recommendations are reasonable and appropriate but might lack further justification or evidence, or might lack in a discussion of generalisability or effort required by the recommendations.
  • 6 (very good): Clear and reasonable recommendations with a good amount of justification and evidence. The discussion shows you understand the unit selection pipeline in both theory and practice. This is a good discussion but you go further with, e.g. discussion of generalisability or analysis of the costs/benefits of the proposed approach. Claims are reasonable but it may be missing citations or clear examples that would make the argument stronger.
  • 7-8: (excellent): Recommendations show excellent, extra insight into issues with the TTS pipeline and demonstrate an excellent  understanding of the gaps between theory and practice. Recommendations are supported by strong argumentation and evidence, perhaps including more error analysis. Claims are supported by evidence from the report or good citations. Errors not discussed in the previous sections should be clearly explained and evidenced. There is some discussion of the generalisability of their recommendations as well as potential drawbacks.
  • 9-10 (excellent++): Everything for the excellent category, plus something extra that goes beyond the scope of the course. The argumentation is flawless and the writing always clear and concise. For a 10, the markers are thinking there is no way that this could be improved given the word limit.

Bibliography

You should provide references for any claims you make outside of what has been covered in the essential course materials. It is not necessary to cite course lectures or videos, though you may wish to reference the readings and other related works.  Similarly, if you are drawing on a paper referenced in the lectures or the course readings, you should reference the actual paper (having had a look at it yourself!)

You can gain extra marks for bringing in insights from related work to strengthen your argumentation (i.e., academic papers that are not on the essential reading list). However, you should be able to complete this assignment to a good standard working just with the core course materials.

Writing style

The writing style of the report should be similar to that of a journal paper. Don’t list every command you typed! Say what you were testing and why, what your input to Festival was, what Festival did and what the output was. Use diagrams (e.g., to explain parts of the pipeline, or to illustrate a linguistic structures) and annotated waveform and spectrogram plots to illustrate your report. It may not be appropriate to use a waveform or spectrogram to illustrate a front-end mistake. Avoid using verbatim output copied from the Terminal, unless this is essential to the point you are making.

Additional tips

Give the exact text you asked Festival to synthesise, so that the reader/marker can reproduce the mistakes you find in Festival (this includes punctuation!). Always explain why each of the mistakes you find belongs in a particular category. For example, differentiate carefully between

  • part of speech prediction errors that cause the wrong entry in the dictionary to be retrieved
  • errors in letter-to-sound rules
  • waveform generation problems (e.g. an audible join)

Since the voice you are using in Festival is Scottish English, it is only fair to find errors for that variety of English, so take care with your spelling of specific input texts! You may find it helpful to listen to some actual speech from the source: Prof Alan Black. Quite conveniently for us, you can in fact listen to Alan Black talk about TTS to study his voice and the subject matter at the same time!

Use of Generative AI: Working with other people (and machines)

It is fine to discuss how to use Festival and the components of the TTS pipeline with other people in the class. Similarly, it’s ok to discuss error categories with other people. However, you should come up with your own analyses for the report and write up your work individually.

Use of spelling and grammar checkers is allowed under university regulations, just as it is fine to ask someone else to proofread your work. However, you should NOT use automated tools such as ChatGPT to generate text for your report (just as it is NOT ok to get someone else to write your report for you).

If want to try to use ChatGPT to help learn concepts via dialogue, just be aware that it still often gets very basic things wrong in this area. Also the voice we are using is different to the standard Festival voices, so it will likely give you incorrect details of the TTS pipeline/voice setup we are actually using in this assignment.

Examples of GenAI getting it wrong

Here’s some examples of ChatGPT giving wrong information about Festival.  You are really better off asking for help in the labs!


Prompt: How does Festival perform word sense disambiguation?

ChatGPT: “Festival is a widely used speech synthesis system, but it doesn’t directly perform word sense disambiguation (WSD) in the same way that dedicated NLP systems or machine learning models do. Festival is primarily focused on generating speech from text, and while it can process various aspects of language to produce natural-sounding speech, word sense disambiguation is typically outside its primary function.”

WRONG – Festival performs explicit WSD (even if only for limited cases)

Prompt: Give me examples of pronunciation errors made by unit selection speech synthesis.

ChatGPT:

“3rd” (third)
Error: The system could mispronounce “3rd” as “three d” (/θriː diː/) instead of “third” (/θɜːrd/).
Cause: Unit selection synthesis systems typically break down the text into individual components, so an abbreviation like “3rd” might be parsed into its individual parts (“three” and “d”) instead of recognizing it as the word “third”.

WRONG – this is a text normalisation error.

WRONG AGAIN – “three d” is very unlikely, should be “three r d”

ChatGPT:

“colonel”

Error: The system may pronounce “colonel” as /ˈkɒlɒnɛl/, rather than the correct /ˈkɜːrnəl/.
Cause: “Colonel” is an example of a word with irregular spelling-to-sound correspondence. The unit selection system may have a recorded unit based on the spelling or a more common English mispronunciation.”

WRONG – recorded units are never “based on the spelling” 

“Proper names, acronyms, and specialized terms might not be in the system’s lexicon. As a result, these words might either be pronounced incorrectly or too literally (like “BMW” being read as “Bee Em Double-U”).”

WRONG – that’s the correct way to say this acronym (i.e., as letters)