Sentence tokenization in festival.

This topic has 1 reply, 2 voices, and was last updated 5 years, 4 months ago by Simon.

Viewing 1 reply thread

Author

Posts
- October 26, 2019 at 12:04 #10062
  Clem A
  Student
  I’m a bit unclear on how sentence tokenization works in festival…
  
  In the Jufrafsky reading it says sentence tokenization algorithms are “trained on machine learning methods rather than being hand built” (+ this explnation is used in one of the examples in the feedback)
  
  But in the festival manual (section 9.1) it says about the utterance chunking decision tree “these are heuristics and written by hand not trained from data”.
  
  Have I got confused about setence tokenization vs. utterance chunking?
- October 26, 2019 at 12:35 #10064
  Simon
  Professor
  Yes, J&M (2nd edition, Section 8.1.1) are discussing segmenting a longer text into sentences, rather than dividing a sentence into tokens for further processing. The former is the harder problem, for the reasons they explain.
  
  They don’t explicitly discuss the latter, but imply that hand-written rules using whitespace and punctuation would be enough, given that this is what happens to the entire text before a classifier is used to find End-Of-Sentence boundaries.
  
  Festival will process multi-sentence text, although its internal data structure assumes a single utterance and there is no representation of “sentences” within an utterance.
  
  So, for the purposes of this assignment you should restrict yourself to isolated single sentences.
Author

Posts

Viewing 1 reply thread

You must be logged in to reply to this topic.