Report Write-up

Write up your findings from the exercise in a lab report, to show that you can make connections between theory and practice.

Report Write-up

The assignment will be assessed based on your written report. The lab report should have a clear structure and be divided into 6 sections.

  1. Text Normalization
  2. POS tagging/Homographs
  3. Pronunciation
  4. Phrase break detection
  5. Waveform generation
  6. TTS pipeline discussion

Each section will be worth 10 marks (60 marks total).  This assignment is worth 30% of your final course mark.

The first 5 sections of of your report should focus on identifying and discussing specific types of errors for the unit selection voice (awb) that you investigate in the labs. The last section (6), we ask you to reflect on what you’ve discovered about the TTS pipeline for the awb voice more generally.

Please note: In 2023-24, you do not have to complete a separate background section in this assignment report (unlike previous years).  You may still find it helpful to draft out some text and/figures describing how the unit selection TTS pipeline works in practice and theory, even though it won’t be submitted as part of your final report.  You can always use it to check your understanding with the lecturers and tutors.

Of course, you should incorporate some background material about the Festival pipeline in your error analyses and discussion text.

More details on what you should include for each section are given below.

Sections 1-5: Error Analyses

For the each of the error analysis sections (1-5), you should at least:

  • Clearly describe exactly the error you have found. That is, tell us:
    • What the text input to Festival was
    • What the expected (i.e., correct) output should have been
    • What the actual output was
  • Clearly explain what TTS pipeline component is responsible for the error and why the error happened in that component. Provide evidence to support your explanation.

Try to be precise in pointing out how the error appears in the linguistic specification and/or the generated waveform (as appropriate). You may find it useful (in some cases necessary) to use tables or figures to illustrate the errors or explanations of where they came from.

Beyond this, you can get more marks for showing more depth of analysis, for example:

  • Further analysis/more evidence of mistakes’ origins, e.g. if it’s due to interactions of modules
  • Analysis of severity of mistakes, e.g., Would it affect output frequently or only rarely?
  • Further analysis of the error from a phonetics perspective, e.g., why does it sound bad? Why might the error affect speech intelligibility or naturalness? Is the error something a human speaker might also make?
  • Discussion of potential solutions
  • Use of citations to further support your claims/argument
  • Other relevant insights into the errors, especially those that highlight differences between theory and practice.
  • Use of figures, tables, other visualizations to make your answer more informative. It will be much easier to report some types of errors with tables, but some will require figures (e.g., waveform errors).

You can describe more than one error per category, but note that you’ll likely get more marks for showing more depth of understanding about the category as a whole. So, several very similar examples will probably not get you as much as an in-depth analysis of a specific example that really shows that you understand what that particular part of the TTS pipeline does (and potentially how it interacts with other parts of the TTS process).

Section 6: TTS pipeline discussion

In the TTS pipeline discussion section, you should discuss the overall implications of your error analysis. When formulating your answer, consider the questions raised in the task description.

When writing up your discussion, imagine that you are a consultant and have been asked to provide recommendations to the company that produced this voice. They have a limited budget and limited time to spend on improving the voice. What should they focus their attention on to improve the voice? What are the most important problems to fix? What are the potential tradeoffs between the usefulness or generalisability of potential solutions, and the effort that would be needed to implement those solutions?

Assume here that the company is committed to sticking with a unit selection voice, so focus on recommendations regarding this voice (i.e., don’t just say change to use a neural network based approach!)

You can consider errors that you found that you didn’t write up in the first part of the report here. However, if you do introduce new error examples in this discussion, you will need to explain them clearly.

Bibliography

You should provide references for any claims you make outside of what has been covered in the essential course materials. It is not necessary to cite course lectures or videos, though you may wish to reference the readings and other related works.

You can gain extra marks for bringing in insights from related work to strengthen your argumentation (i.e., academic papers that are not on the essential reading list). However, you should be able to complete this assignment to a good standard working just with the core course materials.

Writing style

The writing style of the report should be similar to that of a journal paper. Don’t list every command you typed! Say what you were testing and why, what your input to Festival was, what Festival did and what the output was. Use diagrams (e.g., to explain parts of the pipeline, or to illustrate a linguistic structures) and annotated waveform and spectrogram plots to illustrate your report. It may not be appropriate to use a waveform or spectrogram to illustrate a front-end mistake. Avoid using verbatim output copied from the Terminal, unless this is essential to the point you are making.

Additional tips

Give the exact text you asked Festival to synthesise, so that the reader/marker can reproduce the mistakes you find in Festival (this includes punctuation!). Always explain why each of the mistakes you find belongs in a particular category. For example, differentiate carefully between

  • part of speech prediction errors that cause the wrong entry in the dictionary to be retrieved
  • errors in letter-to-sound rules
  • waveform generation problems (e.g. an audible join)

Since the voice you are using in Festival is Scottish English, it is only fair to find errors for that variety of English, so take care with your spelling of specific input texts! You may find it helpful to listen to some actual speech from the source: Prof Alan Black. Quite conveniently for us, you can in fact listen to Alan Black talk about TTS to study his voice and the subject matter at the same time!

Working with other people (and machines)

It is fine to discuss how to use Festival and the components of the TTS pipeline with other people in the class. Similarly, it’s ok to discuss error categories with other people. However, you should come up with your own analyses for the report and write up your work individually.

Use of grammar checkers is allowed under university regulations, just as it is fine to ask someone else to proofread your work. However, you should NOT use automated tools such as ChatGPT to generate text for your report (just as it is NOT ok to get someone else to write your report for you).

If want to try to use ChatGPT to help learn concepts via dialogue, just be aware that it still often gets very basic things wrong in this area. Also the voice we are using is different to the standard Festival voices, so it will likely give you incorrect details of the TTS pipeline/voice setup we are actually using in this assignment.