The Effects of Training Set Size on Decision Tree Complexity
Proceedings of the Sixth International Workshop on Artificial Intelligence and Statistics, PMLR R1:379-390, 1997.
This paper presents experiments with 19 datasets and 5 decision tree pruning algorithms that show that increasing training set size often results in a linear increase in tree size, even when that additional complexity results in no significant increase in classification accuracy. Said differently, removing randomly selected training instances often results in trees that are substantially smaller and just as accurate as those built on all available training instances. This implies that decreases in tree size obtained by more sophisticated data reduction techniques should be decomposed into two parts: that which is due to reduction of training set size, and the remainder, which is due to how the method selects instances to discard. We perform this decomposition for one recent data reduction technique, John’s ROBUST-c4.5 (John 1995), and show that a large percentage of its effect on tree size is attributable to the fact that it simply reduces the size of the training set. We conclude that random data reduction is a baseline against which more sophisticated data reduction techniques should be compared.