Decision tree

Because a decision tree only asks simple 'yes or no' questions about predictors, it works for both categorical and continuous predictors, or a mixture of both.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoWe've already found a need to make predictions.
We might encounter a word that is not in our dictionary and therefore need to predict its pronunciation from its spelling.
Or we might need to predict the duration of a phoneme, given its linguistic context.
Or predict where to place intonation events, and from those predict values for F0, as part of prosody prediction.
All of these tasks have one thing in common.
We know the values of certain properties already and from those we want to predict the value of one further property.
Those properties can be anything.
They could be discrete categories or continuous values.
So we need a general-purpose method that can make predictions in this situation.
A Decision Tree is one such general-purpose method.
I've been trying to write some G2P rules for English.
I'm using my own knowledge of the language to do that.
Here are some rules that I've come up with for the letter 'c'.
The rules read like this.
That's the letter I'm trying to predict the pronunciation of.
That's the previous letter.
That's the next letter.
This underscore means there wasn't a previous letter; in other words it's word-initial.
This isn't a complete set of rules by any means, but it's good enough for this example.
I know the spelling and I'm predicting the pronunciation.
Let's introduce some terminology to talk about that.
Since I believe spelling can be used to predict the pronunciation, the letters are called 'predictors'.
The thing I'm predicting is the 'predictee'.
There are several predictors, so let's give them names: 'C' for the current letter; 'P' for the previous letter; 'N' for the next letter.
Rules are a valid formalism, but it can be difficult to cover all possible cases.
For example, my rules don't yet tell me what I need to do for the case: the letter 'c' preceded by 'i' and followed by 'o'.
They simply don't tell me.
So let's change formalism to one that always covers all cases: a Decision Tree.
To introduce the Decision Tree as a concept, I'm going to use these rules as the starting point, but that's just to help you see the relationship between a list of rules and a Decision Tree.
Later, we'll see that a Decision Tree is much better learned directly from data.
In my little set of rules, the letter 'c' is pronounced as one of 4 possible phonemes.
'Which one?' is the thing we're trying to predict: the predictee.
In this formalism of a Decision Tree, we're going to ask questions about things we know: the predictors.
We're going to gradually narrow down the possible values of the predictee.
We're going to limit those questions to be simple 'yes or no' questions (or 'true / false' if you prefer).
For example, let's query the next letter as the predictor.
If the next letter is 'h', we'll predict that the possible phonemes have been narrowed down to just 2 of the 4.
We're now more certain about how to pronounce the letter 'c'.
Otherwise, it'll be one of the other phonemes.
We're still not certain of the exact phoneme but we've narrowed the possibilities down.
We've reduced our uncertainty.
We can carry on.
Try asking another question.
Let's try distinguishing /ʃ/ from /tʃ/.
This question asks whether we're at the beginning of a word.
If we are, the 'c' is pronounced as /tʃ/.
Otherwise, it's pronounced as /ʃ/.
We could also try to distinguish /k/ from /s/ with another question.
How about asking whether there's an 'o' after the 'c'?
If there is, it'll be pronounced as /k/, otherwise /s/.
This tree is complete and it will make predictions for all possible cases, even ones that were not in the rules.
That's because the answer to every question is either 'yes' or 'no'!
Between them, 'yes' and 'no' cover all possible cases.
This is then a complete Decision Tree.
We can use it to make predictions for any case.
So let's try it.
How do you pronounce the letter 'c' in the word 'coffee'?
Well, just follow the Decision Tree.
Start at the root.
Is there an 'h' after the 'c'?
No, there isn't.
Is there an 'o' after the 'c'?
Yes, there is.
It's pronounced as /k/.
Let's try another word.
Start at the root of the tree.
Is there an 'h' as the next letter?
No.
Is the next 'o'?
No.
We reach a leaf and we predict that the pronunciation is /s/ : 'cyan'.
Try it for yourself.
Find some words containing the letter 'c' and see how often this tree makes the correct prediction.
Pause the video.
I hope you found lots of words where the tree makes the correct prediction.
The tree's not perfect.
It gets it right some of the time, but not all of the time.
This is just a handwritten Decision Tree.
It covers all possible cases.
It covers cases that our rules did not.
So it's already much better than those rules.
But in fact, the tree is nothing more than rules, arranged now into a tree structure rather than in a list.
In other words, the answer to one question determines which question we ask next.
It might be different along the two branches.
Decision Trees have many attractive properties.
We can, if we want, create them by hand, and that's what I just did.
But that's actually not the normal thing to do.
We're going to see in the next topic that we can learn them automatically from data.
In other words, we can employ machine learning to obtain a tree from a set of data points.
Decision Trees are interpretable.
We can draw a picture of them, like this.
That's really nice, and that will still be true even when the tree has been learned automatically from data, because a tree is just a tree
Interpretability is a nice thing to have, and it's not always there in other machine learning methods
Decision Trees are typically compact.
The tree in this picture only has to ask two questions to arrive at a value for the predictee.
If implemented in software, that will be extremely fast.
That will still be true even when the tree becomes much larger (and that will probably happen when we learn one from data), because the number of computations in the tree to do inference - in other words, to make a prediction - is proportional to the number of questions to ask.
That's to do with the depth of the tree, not the width of the tree.
We always draw Decision Trees this way so that decisions flow from top of the page to bottom of the page.
Real trees are this way up.
We're just drawing them upside down.
The type of Decision Tree that we've made is called a 'CART' (Classification And Regression Tree).
Now the tree can either do classification or regression, but not both at the same time.
So 'And' should really be 'Or'.
In a Classification Tree, the predictee is a categorical variable.
It takes on one of a fixed number of possible values, or classes.
In a Regression Tree, the predictee is a continuous value.
Here, I've made up a tree that predicts the height of an F0 event based on the stress level of a syllable.
This tree is doing regression.
Decision Trees, whether they are Classification Trees, or Regression Trees, can be learned from data.
So that's what we'll look at next.
It's time for some machine learning!
But before we do that, let's be very clear that a Decision Tree is a Decision Tree, regardless of whether it was hand-crafted or learned from data.
We won't be able to tell by looking at a tree how it was made.
So we must make a careful distinction between the model (here, the Decision Tree) and the algorithm used to learn that model.
This topic video was about the model and the next topic is about the algorithm.
That's a very important distinction in machine learning, that you must get clear in your mind: a distinction between the model and the algorithm used to learn the model.