We have now completed all of the front-end text processing that we need for TTS and we’re ready to generate a waveform. The next module describes one way to do that, involving the concatenation of pre-recorded units of speech. We’ll choose units that match our predicted pronunciation, and then use signal processing to impose our predicted prosody.
After this module you should be able to:
- Describe what the goal of the TTS front-end is
- Explain what a linguistic specification is in theory
- Explain why text normalization is necessary for TTS and give examples of types of normalization in terms of tokenization, non-standard words and word sense disambiguation (e.g. POS tags)
- Explain what a phoneset is and why it might differ for different dialects of the same language
- Describe what you’d expect to find in a pronunciation dictionary
- Explain why we need both pronunciation dictionaries and letter-to-sound rules in the TTS front-end
- Explain why we need to analyze the data in terms of phone level pronunciations and prosodic features
- Explain the difference between a phoneme and and allophone, and how this might relate to the construction of pronunciation dictionaries and letter to sound rules
- Explain how rules are structured and applied using a decision tree
- Describe a method for deciding how to order the questions in a Decision Tree
What you should know
- What’s the overall purpose of the TTS front-end? What’s a linguistic specification?
- Tokenization and normalization: Why do we need to do this? What are Non-Standard Words? What ambiguities do we need to resolve?
- Handwritten rules, Finite State transducer:
- We may ask you to come up with some rules to solve a specific TTS front-end task in the form of a decision tree.
- We won’t ask you to come up with a Finite State Transducer but we may ask you to interpret what a given one does to a specific input (e.g. for text normalization)
- Phonemes and allophones:
- You should know what the difference between a phoneme and an allophone is and how these potentially relate to deriving pronunciations.
- There won’t be any phon “data” problems, e.g. deriving that something is an allophone
- Pronunciations:
- Explain what phone sets are and why different ones may be appropriate for different TTS voices (e.g. CMUDict vs Unilex)
- Explain what should be included in the TTS pronunciation dictionary
- Explain why we also need Letter-to-Sound (grapheme to phoneme) rules
- We won’t ask you about the letter-to-phone alignment method described in J&M 8.2.3 (though we may ask about learning pronunciations via decision trees as discussed in the videos/lecture)
- Prosody: It’s fine to focus on what was covered in videos in this module for this (Also see the module 6 prosodic structure video for more info).
- Explain why we want to predict prosodic features for TTS, e.g. rhythm, intonation, phrasing, emphasis
- Explain what aspects of the text might you want to consider in predicting prosodic features? e.g.think about assignment 1
- We won’t ask you about specific prosodic transcription methods, e.g. ToBI pitch accents or boundary tones, but we may ask why you might want to predict something like ToBI labels in a TTS front end system.
- we won’t ask about tf.idf or accent ratio in J&M 8.3.2, or the content in 8.3.4 (”Other Intonation models”)-8.3.6
- Decision tree, Learning decision trees:
- What is a decision tree? How do we interpret it?
- For what sort of TTS front-end tasks might you want to use a decision tree?
- What is entropy (from a information theory/probability point of view) and how do we use this learn decision trees from data? A high level understanding will suffice, e.g. look at two distributions (e.g. counts over categories) and say which has higher entropy (e.g. in videos/lecture).
- We may ask you to come up with some rules to solve a specific TTS front-end task in the form of a decision tree (but we won’t ask you to derived one from data in the exam).
Key Terms
- front-end, back-end
- linguistic specification
- tokenisation
- normalisation
- Non-Standard Word (NSW)
- homograph
- finite state transducer
- phoneme, allophone
- pronunciation dictionary, lexicon
- phone set
- Letter-to-Sound
- Grapheme-to-phoneme
- Prosody
- intonation
- pitch
- loudness
- duration
- phrase break
- prominence
- decision tree
- entropy
- classification