Finish

We have now completed all of the front-end text processing that we need for TTS and we’re ready to generate a waveform. The next module describes one way to do that, involving the concatenation of pre-recorded units of speech. We’ll choose units that match our predicted pronunciation, and then use signal processing to impose our predicted prosody.

After this module you should be able to:

  • Describe what the goal of the TTS front-end is
  • Explain what a linguistic specification is in theory
  • Explain why text normalization is necessary for TTS and give examples of types of normalization in terms of tokenization, non-standard words and word sense disambiguation  (e.g. POS tags)
  • Explain what a phoneset is and why it might differ for different dialects of the same language
  • Describe what you’d expect to find in a pronunciation dictionary
  • Explain why we need both pronunciation dictionaries and letter-to-sound rules in the TTS front-end
  • Explain why we need to analyze the data in terms of phone level pronunciations and prosodic features
  • Explain the difference between a phoneme and and allophone, and how this might relate to the construction of pronunciation dictionaries and letter to sound rules
  • Explain how rules are structured and applied using a decision tree
  • Describe a method for deciding  how to order the questions in a Decision Tree

What you should know

  • What’s the overall purpose of the TTS front-end? What’s a linguistic specification?
  • Tokenization and normalization: Why do we need to do this? What are Non-Standard Words? What ambiguities do we need to resolve?
  • Handwritten rules, Finite State transducer:
    • We may ask you to come up with some rules to solve a specific TTS front-end task in the form of a decision tree.
    • We won’t ask you to come up with a Finite State Transducer but we may ask you to interpret what a given one does to a specific input (e.g. for text normalization)
  • Phonemes and allophones:
    • You should know what the difference between a phoneme and an allophone is and how these potentially relate to deriving pronunciations.
    • There won’t be any phon “data” problems, e.g. deriving that something is an allophone
  • Pronunciations:
    • Explain what phone sets are and why different ones may be appropriate for different TTS voices (e.g. CMUDict vs Unilex)
    • Explain what should be included in the TTS pronunciation dictionary
    • Explain why we also need Letter-to-Sound (grapheme to phoneme) rules
    • We won’t ask you about the letter-to-phone alignment method described in J&M 8.2.3 (though we may ask about learning pronunciations via decision trees as discussed in the videos/lecture)
  • Prosody: It’s fine to focus on what was covered in videos in this module for this (Also see the module 6 prosodic structure video for more info).
    • Explain why we want to predict prosodic features for TTS, e.g. rhythm, intonation, phrasing, emphasis
    • Explain what aspects of the text might you want to consider in predicting prosodic features? e.g.think about assignment 1
    • We won’t ask you about specific prosodic transcription methods, e.g. ToBI pitch accents or boundary tones, but we may ask why you might want to predict something like ToBI labels in a TTS front end system.
    • we won’t ask about tf.idf or accent ratio in J&M 8.3.2, or the content in 8.3.4 (”Other Intonation models”)-8.3.6
  • Decision tree, Learning decision trees:
    • What is a decision tree? How do we interpret it?
    • For what sort of TTS front-end tasks might you want to use a decision tree?
    • What is entropy (from a information theory/probability point of view) and how do we use this learn decision trees from data? A high level understanding will suffice, e.g. look at two distributions (e.g. counts over categories) and say which has higher entropy (e.g. in videos/lecture).
    • We may ask you to come up with some rules to solve a specific TTS front-end task in the form of a decision tree (but we won’t ask you to derived one from data in the exam).

Key Terms

  • front-end, back-end
  • linguistic specification
  • tokenisation
  • normalisation
  • Non-Standard Word (NSW)
  • homograph
  • finite state transducer
  • phoneme, allophone
  • pronunciation dictionary, lexicon
  • phone set
  • Letter-to-Sound
  • Grapheme-to-phoneme
  • Prosody
  • intonation
  • pitch
  • loudness
  • duration
  • phrase break
  • prominence
  • decision tree
  • entropy
  • classification