Tokenisation

This topic has 3 replies, 3 voices, and was last updated 7 years, 7 months ago by Simon.

Viewing 3 reply threads

Author

Posts
- October 12, 2017 at 20:16 #7927
  blue fish
  Student
  From the lecture in module 3, I am not very clear on how words are tokenised.
  
  So in the lecture we wrote down features like if the text is a abbreviation or number. If we identify something with one of those features, where do we look to interpret them? e.g. The text has “202”, where do I find the place that turns in into “two hundred and two”?
  
  I tried looking at the lecture notes but the next part after tokenisation is POS tagging so I’m missing something?
- October 13, 2017 at 10:08 #7929
  Simon
  Professor
  We shouldn’t talk about “words being tokenised” because tokenisation happens before we know anything about words. The input to TTS is a string of characters. Tokenisation splits this long string into small pieces, ready for further processing. The method might be as simple as some rules using whitespace and punctuation. Each small piece might already be a normal word, or it might not: a Non Standard Word (NSW).
  
  The exercise in the lecture was not about tokenisation. It was about normalisation, which is usually done in two stages: 1) classify each token as either a standard word, or a NSW of one of a set of types (e.g., abbreviation, money, percentage,…); 2) expand each NSW into normal words, using a specific technique for each type.
  
  The features needed for the classification step cannot be things like “is it an abbreviation” because that is what the classifier is predicting. We can only use features that can be obtained directly from the character string, such as “Is it all upper case?” or “Does it contain 3 or more consecutive digits?”
  
  The expansion step involves a specific technique for each type of NSW. For example:
  - ASWD (“as word”) would be downcased and passed to the Letter-to-Sound (LTS) module to be treated like any other Out-of-Vocabulary (OOV) word
  - LSEQ (“letter sequence”) would be split into individual letters, each of which becomes a word; the dictionary will contain pronunciations for all individual letters in the language
  We didn’t cover expansion in any great detail in class. Details can be found in the readings: Jurafsky & Martin 8.1.
- October 17, 2017 at 16:32 #7952
  Chloe L
  Student
  At what step does expansion happen in festival?
- October 17, 2017 at 17:58 #7953
  Simon
  Professor
  You can work this out for yourself, running in “step-by-step mode”. Use a sentence that includes a token needing expansion (e.g., “$3.21”) and see at which step it becomes a sequence of words.
  
  Remember that the individual steps (modules) in Festival may each perform multiple processes, so it’s possible that classification and expansion might happen in the same module, or in separate modules. Again, this is something you can work out for yourself in the lab.
Author

Posts

Viewing 3 reply threads

You must be logged in to reply to this topic.

Tokenisation

Search the forums

Note

Latest Activity

Search the forums

Speech Synthesis