The Effects of Training Set Size on Decision Tree Complexity

Tim Oates, David Jensen
Proceedings of the Sixth International Workshop on Artificial Intelligence and Statistics, PMLR R1:379-390, 1997.

Abstract

This paper presents experiments with 19 datasets and 5 decision tree pruning algorithms that show that increasing training set size often results in a linear increase in tree size, even when that additional complexity results in no significant increase in classification accuracy. Said differently, removing randomly selected training instances often results in trees that are substantially smaller and just as accurate as those built on all available training instances. This implies that decreases in tree size obtained by more sophisticated data reduction techniques should be decomposed into two parts: that which is due to reduction of training set size, and the remainder, which is due to how the method selects instances to discard. We perform this decomposition for one recent data reduction technique, John’s ROBUST-c4.5 (John 1995), and show that a large percentage of its effect on tree size is attributable to the fact that it simply reduces the size of the training set. We conclude that random data reduction is a baseline against which more sophisticated data reduction techniques should be compared.

Cite this Paper


BibTeX
@InProceedings{pmlr-vR1-oates97b, title = {The Effects of Training Set Size on Decision Tree Complexity}, author = {Oates, Tim and Jensen, David}, booktitle = {Proceedings of the Sixth International Workshop on Artificial Intelligence and Statistics}, pages = {379--390}, year = {1997}, editor = {Madigan, David and Smyth, Padhraic}, volume = {R1}, series = {Proceedings of Machine Learning Research}, month = {04--07 Jan}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/r1/oates97b/oates97b.pdf}, url = {https://proceedings.mlr.press/r1/oates97b.html}, abstract = {This paper presents experiments with 19 datasets and 5 decision tree pruning algorithms that show that increasing training set size often results in a linear increase in tree size, even when that additional complexity results in no significant increase in classification accuracy. Said differently, removing randomly selected training instances often results in trees that are substantially smaller and just as accurate as those built on all available training instances. This implies that decreases in tree size obtained by more sophisticated data reduction techniques should be decomposed into two parts: that which is due to reduction of training set size, and the remainder, which is due to how the method selects instances to discard. We perform this decomposition for one recent data reduction technique, John’s ROBUST-c4.5 (John 1995), and show that a large percentage of its effect on tree size is attributable to the fact that it simply reduces the size of the training set. We conclude that random data reduction is a baseline against which more sophisticated data reduction techniques should be compared.}, note = {Reissued by PMLR on 30 March 2021.} }
Endnote
%0 Conference Paper %T The Effects of Training Set Size on Decision Tree Complexity %A Tim Oates %A David Jensen %B Proceedings of the Sixth International Workshop on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 1997 %E David Madigan %E Padhraic Smyth %F pmlr-vR1-oates97b %I PMLR %P 379--390 %U https://proceedings.mlr.press/r1/oates97b.html %V R1 %X This paper presents experiments with 19 datasets and 5 decision tree pruning algorithms that show that increasing training set size often results in a linear increase in tree size, even when that additional complexity results in no significant increase in classification accuracy. Said differently, removing randomly selected training instances often results in trees that are substantially smaller and just as accurate as those built on all available training instances. This implies that decreases in tree size obtained by more sophisticated data reduction techniques should be decomposed into two parts: that which is due to reduction of training set size, and the remainder, which is due to how the method selects instances to discard. We perform this decomposition for one recent data reduction technique, John’s ROBUST-c4.5 (John 1995), and show that a large percentage of its effect on tree size is attributable to the fact that it simply reduces the size of the training set. We conclude that random data reduction is a baseline against which more sophisticated data reduction techniques should be compared. %Z Reissued by PMLR on 30 March 2021.
APA
Oates, T. & Jensen, D.. (1997). The Effects of Training Set Size on Decision Tree Complexity. Proceedings of the Sixth International Workshop on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research R1:379-390 Available from https://proceedings.mlr.press/r1/oates97b.html. Reissued by PMLR on 30 March 2021.

Related Material