Finish

Another example to work through: phrase break prediction

We will be doing this as an exercise in class, in the Module 4 main lecture. Feel free to try it on your own in advance, or you can wait to do it in-class.

Now that you know how to train and then use a CART for letter-to-sound prediction, you can build your own CART for another task. Let’s try to do that for phrase break prediction: this is a classification task because we are predicting a discrete value (phrase break or no phrase break). If you can do this exercise, you should be able to see that a CART can be applied to any classification or regression task.

The raw training data

The method requires training data, manually annotated with phrase breaks. We must decide what features to extract from the text, to use as predictors for the CART. Then, we follow a training algorithm to learn the tree from the training data. After that, we can label the locations of phrase breaks on previously-unseen test data.

Download the corpus of training sentences and think about what features you can extract from them. Hint: you only need to consider features that the corpus is already annotated with. Choose up to three features that you will use as predictors in your CART. The predictee is the presence or absence of a phrase break.

The provided data is POS tagged with a very simple set of tags – don’t worry if you don’t agree with the tag set or some of the words’ tags – just use the data as provided.

Prepare the training data by extracting features

Use this sheet to prepare your training data. Label each data point (i.e., each word) with your chosen predictors. I’ve already filled in the predictee for you – breaks are associated with the word that they occur after. This means that end-of-sentence punctuation is not labelled with a phrase break.

Candidate questions for the CART

Look across your predictors and write down all the possible questions you might ask about them. We’ll keep it simple and just ask “Does Predictor X have value A” type questions, and not try to form category questions such as “Is the value of Predictor X in the set {A, B, C}”

Here’s a sheet where you can write down the questions.

Build (train) your CART – start with the root node

Place all your data at the root node and compute the entropy. This is a measure of how predictable the value of the predictee is at this point. I’ll start you off:

The value “BREAK” occurs 12 times so its probability is 12/34 which is about 0.35.

The value “NO BREAK” occurs 22 times so its probability is 22/34 which is about 0.65.

Now compute entropy using “- sum of p log p”. Make sure you know how to do this for yourself first. Then, to save time you could use this entropy calculator.

For every possible question in your list of questions, split the data into the two corresponding partitions. Compute the entropy of each partition. Then compute the total entropy, which is a weighted sum of those two values.

It will be tedious to try all questions manually, so you can take a short cut: use your intuition and just try those few questions that you think will work the best.

Pick the best question, place it into the tree. Permanently partition the training data using it.

Recurse

Now simply recurse. In other words, repeat exactly what you did at the root node, for each of the two children in turn, using whatever subset of the training data you find at that child node. Then continue growing the tree. You’ll need to set yourself a stopping criterion: for this toy data set, I suggest that you stop when all data points at a leaf have the same value for the predictee (i.e., zero entropy).

Make predictions (testing phase)

Now use your CART to predict where phrase breaks occur in the test data. You’ll first need to extract the same features (predictors) as for the training data, of course.

How well does your CART work?

After you have made your predictions of the test data labels, compare them to the correct labels for the test data and compute the accuracy. How often does your tree get it right?