› Forums › Speech Synthesis › The front end › CART › CART: stopping criterion
- This topic has 1 reply, 2 voices, and was last updated 8 years, 2 months ago by Simon.
-
AuthorPosts
-
-
October 9, 2016 at 11:11 #5268
In the CART video that uses sweets, you end up with one leaf that has a distribution of labels (0.5 orange, 0.5 lime) rather than a single label. You say that you could continue splitting the tree, but that the stopping criterion prevents you from doing that.
What’s the stopping criterion (e.g., ‘stop after n splits’, ‘stop after n examples have been sorted’)? How would having a leaf with an evenly split distribution help sort all the unlabelled samples? Do you need the criterion in cases with huge sets of examples, where having no specified stopping point could result in an unmanageable number of splits?
-
October 9, 2016 at 13:27 #5271
The video is a little informal, and you spotted that I didn’t actually specify what stopping criterion I used.
What’s the stopping criterion?
Several different stopping criteria might have prevented me splitting any further at that node. For example:
- the number of data points at this node fell below a fixed threshold
- none of the remaining questions gave a useful split of the data (i.e., none of them could decrease the entropy by much)
- any remaining questions that did actually reduce entropy also resulted in a split in which one of the partitions had a very small number of data points
Can you explain the reasoning behind each of these different stopping criteria? Can you think of any other stopping criteria?
How would having a leaf with an evenly split distribution help sort all the unlabelled samples?
The reasoning here is that unbalanced splits mean that the select question only applies to a small number of data points, and so is less likely to also apply to the test data. Balanced splits result from questions that apply to “around half” of the data points, and so we are a lot more confident that these questions will generalise to unseen test data points.
Let’s also change your terminology: instead of “sort all the unlabelled samples” you should say “makes accurate predictions of the labels of unseen test samples”.
Do you need the criterion in cases with huge sets of examples, where having no specified stopping point could result in an unmanageable number of splits?
It’s not really a problem to have a deep tree, so the number of splits can’t really become “unmanageable”.
Can you write down a formula that calculates the depth of the tree (average number of questions/splits/nodes between root and leaves), as a function of the number of leaves?
-
-
AuthorPosts
- You must be logged in to reply to this topic.