Finish

We have now completed all of the front-end text processing that we need for TTS and we’re ready to generate a waveform. The next module describes one way to do that, involving the concatenation of pre-recorded units of speech. We’ll choose units that match our predicted pronunciation, and then use signal processing to impose our predicted prosody.

After this module you should be able to:

Describe what the goal of the TTS front-end is
Explain what a linguistic specification is in theory
Explain why text normalization is necessary for TTS and give examples of types of normalization in terms of tokenization, non-standard words and word sense disambiguation (e.g. POS tags)
Explain what a phoneset is and why it might differ for different dialects of the same language
Describe what you’d expect to find in a pronunciation dictionary
Explain why we need both pronunciation dictionaries and letter-to-sound rules in the TTS front-end
Explain why we need to analyze the data in terms of phone level pronunciations and prosodic features
Explain the difference between a phoneme and and allophone, and how this might relate to the construction of pronunciation dictionaries and letter to sound rules
Explain how rules are structured and applied using a decision tree
Describe a method for deciding how to order the questions in a Decision Tree

What you should know

What’s the overall purpose of the TTS front-end? What’s a linguistic specification?
Tokenization and normalization: Why do we need to do this? What are Non-Standard Words? What ambiguities do we need to resolve?
Handwritten rules, Finite State transducer:
- We may ask you to come up with some rules to solve a specific TTS front-end task in the form of a decision tree.
- We won’t ask you to come up with a Finite State Transducer but we may ask you to interpret what a given one does to a specific input (e.g. for text normalization)
Phonemes and allophones:
- You should know what the difference between a phoneme and an allophone is and how these potentially relate to deriving pronunciations.
- There won’t be any phon “data” problems, e.g. deriving that something is an allophone
Pronunciations:
- Explain what phone sets are and why different ones may be appropriate for different TTS voices (e.g. CMUDict vs Unilex)
- Explain what should be included in the TTS pronunciation dictionary
- Explain why we also need Letter-to-Sound (grapheme to phoneme) rules
- We won’t ask you about the letter-to-phone alignment method described in J&M 8.2.3 (though we may ask about learning pronunciations via decision trees as discussed in the videos/lecture)
Prosody: It’s fine to focus on what was covered in videos in this module for this (Also see the module 6 prosodic structure video for more info).
- Explain why we want to predict prosodic features for TTS, e.g. rhythm, intonation, phrasing, emphasis
- Explain what aspects of the text might you want to consider in predicting prosodic features? e.g.think about assignment 1
- We won’t ask you about specific prosodic transcription methods, e.g. ToBI pitch accents or boundary tones, but we may ask why you might want to predict something like ToBI labels in a TTS front end system.
- we won’t ask about tf.idf or accent ratio in J&M 8.3.2, or the content in 8.3.4 (”Other Intonation models”)-8.3.6
Decision tree, Learning decision trees:
- What is a decision tree? How do we interpret it?
- For what sort of TTS front-end tasks might you want to use a decision tree?
- What is entropy (from a information theory/probability point of view) and how do we use this learn decision trees from data? A high level understanding will suffice, e.g. look at two distributions (e.g. counts over categories) and say which has higher entropy (e.g. in videos/lecture).
- We may ask you to come up with some rules to solve a specific TTS front-end task in the form of a decision tree (but we won’t ask you to derived one from data in the exam).

Key Terms

front-end, back-end
linguistic specification
tokenisation
normalisation
Non-Standard Word (NSW)
homograph
finite state transducer
phoneme, allophone
pronunciation dictionary, lexicon
phone set
Letter-to-Sound
Grapheme-to-phoneme
Prosody
intonation
pitch
loudness
duration
phrase break
prominence
decision tree
entropy
classification