Learning decision trees

Having defined the model, we now need an algorithm to estimate it from data. For a Decision Tree, this is a simple greedy algorithm.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoLet's start with a reminder.
We need to be very clear that a Decision Tree is the model.
A Decision Tree is a Decision Tree, regardless of whether it was handcrafted or learned from data.
We must always make a careful distinction in machine learning between the model (here, it's the Decision Tree) and the algorithm used to learn that model.
The previous topic introduced the model and now we're going to introduce the algorithm.
As an example we'll use G2P.
This video is going to keep things fairly short and simple.
To really understand Decision Trees, you'll need to do some worked examples.
In my previous attempts to create a G2P model for English, I started from the knowledge in the mind of a native speaker.
In that case, it was myself.
That speaker expressed the knowledge as a set of rules.
We then saw how we could compactly write those rules as a Decision Tree.
The Decision Tree is better than the rules in the sense that it will generalise to any case that wasn't covered by the rules.
But why ask our Native Speaker to write down rules at all?
The rules might not cover all cases.
We might have difficulty in deciding the relative importance of the rules and getting them in the right order.
We're going to move away from rules to a Decision Tree, which is effectively a way to arrange rules (which we'll now call 'questions') whose answer is either 'yes' or 'no' ('true' or 'false') in the best order, with the added advantage of covering all cases.
But instead of going from the mind of a native speaker to rules, and from rules to a tree, let's cut out those rules and work straight from data.
So what would that data be for G2P?
Well, it will be a list of words: spellings and their correct pronunciations.
That's still knowledge distilled from the mind of a native speaker.
So we'll replace imperfect rules with reliable data points.
This data is, in fact, just a pronunciation dictionary.
So now the question is: how to learn a Decision Tree from these data?
To keep my example simple, I'm just going to use words with a single letter 'c' that is is pronounced as 1 of 4 phonemes.
Here's some of my data.
The thing we're predicting (the predictee) is the phoneme that the letter 'c' will be pronounced as.
A predictors, I'm going to choose a window of letters centred around the letter 'c'.
This choice to use a window of letters is another place where knowledge is being incorporated into the model.
This time, it's knowledge from me, the 'Machine Learning Engineer'!
Machine learning isn't just about data and model; it's about all the choices that we have to make along the way.
Now we simply extract from this raw data the predictors and the predictee.
This process - going from raw data to the set of data points that we're going to do machine learning with - involves some decisions.
Here, the decision was to use a window of letters: +/- 2 letters around the letter 'c'.
All the predictors have to have a value all of the time.
That's involved padding in cases where we're at the end or the beginning of a word.
Padding at the start and end of a sequence is often necessary when we use a sliding window, just like in signal processing.
On the right are our data points.
Each of these is an individual data point, of predictors and the correct value for the predictee.
Now we can discard the raw data.
Here's some of my data set.
There's lots and lots more of that.
In fact, from the dictionary I was extracting this data from, I obtained a total of around 16,500 data points.
From those data, we are going to learn a Decision Tree.
I'll call this the 'training data'.
The words 'training' and 'learning' are both used to refer to the procedure for automatically estimating a model from data.
In the case of Decision Trees, the term 'learning' is the usual way to describe this procedure.
Machine learning means learning from data.
What are we trying to achieve?
Well, we're trying to generalise from all of these data points that we've seen before, to a new data point.
Given the letter 'c' in this particular context, we have to make a prediction for the phoneme to pronounce.
A typical machine learning technique, then, is to learn a model from the data in a learning or training phase, and to use that to make the predictions.
From the data, we learn a model.
The model is what we then use to do inference, which is the process of taking the predictors and predicting the predictee in a previously-unseen data point - a new data point.
Sometimes that phase is called the 'testing' phase.
Here, the model is going to be a Decision Tree.
We're talking about learning that Decision Tree from the data.
Back to the training data.
I'm just showing you a sample.
In total, there are 16,500 training data points.
Let's start by considering the probability distribution of the predictee.
It's a phoneme and it takes 1 of 4 possible values.
There's the distribution.
This histogram is the probability distribution over the phoneme value, which is on this axis.
We can think of these bars as the data points.
So think of this bar is made of all of those /k/ data points: this one, this one, this one, this one, and so on.
They're all in that 'bucket' there.
Now, if you had to predict how to pronounce the letter 'c' and I didn't tell you anything more than that, I think it's pretty obvious that you'd predict /k/, because that's the most frequent value in the training data.
You'd be correct a lot of the time (most of the time, in fact).
So we can make a prediction already.
But of course, we know more than that.
We know the values of the predictors: the surrounding letters, +/- 2 letters.
We should make use of that information by asking questions of it: by querying it.
Let's imagine the ideal Decision Tree that we could learn from our data.
This distribution is of the entire training data.
We find a question about the predictors that beautifully splits this data like this.
Some question - to be found - that splits all the /k/ and /s/ into one partition, and all the /ʃ/ and /tʃ/ into the other partition.
If the answer to that question was 'yes', we'd continue to predict /k/, because that's still the most frequent value here.
But if the answer to this question was 'no', we'd predict /tʃ/.
Now we're going to be correct more of the time than just predicting /k/.
Then we continue.
We find other questions that perfectly split these two partitions of the data.
We find a question here that splits this beautifully into all the /k/ and all the /s/.
We find another question over here that beautifully splits that partition into two smaller partitions of /ʃ/ and /tʃ/.
Now we've got a perfect Decision Tree.
This Decision Tree will correctly predict the pronunciation of the letter 'c' as 1 of these 4 phonemes, and it will be 100% correct: it'll be right all of the time, at least for the data points in the training data.
So what has the tree done to the distribution of predictee values?
It's made that distribution more and more certain.
In this perfect tree, at the leaves, we become completely certain that, if we reach this leaf, it's always /tʃ/.
100% of the data points are /tʃ/ and none of the other values occur.
This distribution has zero entropy: there is perfect certainty.
As we partition the data going down the tree, we increased certainty at every level.
In other words, we reduced entropy.
Each question that we asked about the data partitioned it into two partitions.
That reduced the entropy, compared to before the split.
This particular tree might not be possible.
There might simply not be any question that splits the data in that way.
The real tree won't be quite so perfect.
Let's find out!
Back to the training data.
We already said that, if we wanted to make a prediction right away without asking any questions about the predictors, we'll just pick the most common value of the predictee found in the training data.
We can compute that by looking at this distribution and picking the highest bar.
Well, we've already got a Decision Tree.
It's only got a root node.
The root note says 'Don't bother asking any questions! Just predict /k/.'
In my training data set, that will be correct about 68% of the time, for the training data.
That's not bad, for not asking any questions, but we can do better.
We'll put that at the root of a tree.
We're now going to try partitioning this training data into two partitions that have lower entropy than this distribution here.
The questions, of course, have to be about the predictors: about the letters.
We have 4 predictors and each of them can take on values of the 26 letters or the underscore, or 27 possible values.
4 x 27 = 108.
So there are 108 possible questions we could ask.
Let's try one.
Let's ask 'Is the next letter 'e' ?'
If it is, some of the data goes this way.
We could look at the distribution of that data: it looks like this.
We predict /s/ as the pronunciation with quite a lot of certainty.
If the next letter isn't 'e', all the rest of the data goes this way.
We can look at its distribution and we'll continue to pronounce this letter 'c' as /k/.
Just visually, we can see that this distribution obviously has fairly low entropy: it's dominated by a single value.
This one has a little bit lower entropy than all of the training data.
So the total entropy in these two partitions is lower than the entropy in the original data.
But that was just one possible question.
We need to try all of the other 107 questions available to us.
Back up to the root, where we're predicting /k/ all the time.
I did try all 108 questions for this data (which is real data) and I found that this question 'Is the next letter 'h' ?' was the one that gave the best split.
It reduced entropy by more than any other question.
So I put it permanently into my tree.
I'm now permanently partitioning the training data into these two partitions.
The most common value on the left is /tʃ/ and the most common value on the right is /k/.
This is a complete Decision Tree, and we could stop here or we could continue.
To continue, we simply take this and recurse with the same procedure.
We take this as the training data.
We try partitioning with every possible question and seeing which question can further reduce the entropy.
Then we do the same on the right node.
Here's what actually happens for this real data set.
I'm just going to now show the best question in each case.
That was found by trying all of the possible ones and measuring the entropy reduction.
I split this left node by the question 'Is the previous letter 's' ?'
That split the data like this.
I labelled the leaves with /ʃ/ and /tʃ/
On the right, I asked the question 'Is the next letter 'e' ?'
That was the best one, chosen out of all the possible questions I could've asked.
If it was 'yes', it's /s/.
If it's no, it's /k/.
Then I kept going a bit.
I tried partitioning here.
I found that, if I asked 'Is the next letter 'i' ?', that gave this split into /s/ and /k/ and I stopped there.
I'm going to stop growing this tree now because I'm running out of space to draw it.
But in reality we need a better reason to stop!
We need a stopping criterion.
That criterion could be that we can't reduce the entropy by very much, or that there are too few data points left.
Remember that every time we split, we partition the data into smaller and smaller sets and eventually we'll run out of data.
Or we could simply stop when we've reached some certain depth of the tree.
We've reached the end of that sequence on Decision Trees.
Let's repeat this very important distinction between the model and the algorithm.
That's why the material is split across two topics.
In the topic 'Decision Tree', we introduced the model.
In this topic, 'Learning Decision Trees', we introduced the algorithm that is used to learn that model from data.
That's the most important message in this video in fact: this distinction in machine learning between model and algorithm.
It's something people confuse very easily.
So we've now got a general-purpose piece of machine learning - the Decision Tree - which can do classification or regression.
It asks questions about predictors.
In all of my examples, the predictors were categorical things like letters, but they don't have to be.
They could be continuous values that we could ask questions about: 'Are they above or below a certain value?', for example.
This is a very general-purpose piece of machine learning.
It can take predictors of any type and it can predict a predictee that's either categorical or continuous.
We'll find that that could be applied in lots of different places in speech synthesis, from predicting pronunciation to prosody.