CART: hand-labelling the training data

This topic has 1 reply, 2 voices, and was last updated 9 years, 7 months ago by Simon.

Viewing 1 reply thread

Author

Posts
- October 13, 2015 at 18:40 #294
  Cecilia C
  Student
  The first part of the video on clasification and regression trees(CART) is about hand-labelling data. Could it be possible to have an example of how really a text is hand labelled? Is every word of a given text examinated by a human and labelled? And how many question per word does this process generally need?
- October 14, 2015 at 11:23 #310
  Simon
  Professor
  Here are some examples of data that must be hand-labelled before we can apply machine learning (e.g., training a classification tree):
  
  1. letter-to-sound
  
  The hand-labelled data consists of words and their pronunciations, such as this (extracted from cmulex):
```
...
editing   eh1 d ax t ih0 ng
edition   ax d ih1 sh ax n
editions  ih0 d ih1 sh ax n z
editor    eh1 d ax t er0
editorial eh1 d ax t ao1 r iy0 ax l
...
```
  which is in fact just the pronunciation dictionary that we will already have created by hand. The lexicon may also provide a syllabification of the phoneme string. It does not specify the alignment between letters and phonemes.
  
  2. phrase-break prediction
  
  We will hand-label the phrase breaks in a set of 100s or 1000s of recorded utterances. Where possible, we will use existing data that some kind person has already labelled, such as the Boston University Radio News corpus.
  
  When you say “how many question per word does this process generally need” I think you are referring to how we choose the predictors for training a CART. This is done through expert knowledge, remembering that it’s OK to have a large set of predictors because the CART training procedure will only select the useful ones.
Author

Posts

Viewing 1 reply thread

You must be logged in to reply to this topic.

CART: hand-labelling the training data

Search the forums

Note

Latest Activity

Search the forums

Speech Synthesis