The training data

Supervised machine learning starts with training data, labelled with the value of the predictee. You now need to decide what features (predictors) to extract.

Download the corpus of training sentences and think about what features you can extract from them. Hint: you only need to consider features that the corpus is already annotated with. Choose up to three features that you will use as predictors in your CART. The predictee is the presence or absence of a phrase break.

The provided data is POS tagged with a very simple set of tags – don’t worry if you don’t agree with the tag set or some of the words’ tags – just use the data as provided.

Use this sheet to prepare your training data. Label each data point (i.e., each word) with your chosen predictors. I’ve already filled in the predictee for you – breaks are associated with the word that the occur after. This means that end-of-sentence punctuation is not labelled with a phrase break.

Video to be added after the lecture…