The task

We want to predict the locations of phrase breaks, just from text. Our chosen method is a simple form of machine learning: a classification tree (CART).

The method requires training data, manually annotated with phrase breaks. We must decide what features to extract from the text, to use as predictors for the CART. Then, we follow a training algorithm to learn the tree from the training data. After that, we can label the locations of phrase breaks on previously-unseen test data.

Video to be added after the lecture…