A Distance Metric for Classification Trees

William D. Shannon, David Banks
Proceedings of the Sixth International Workshop on Artificial Intelligence and Statistics, PMLR R1:457-464, 1997.

Abstract

CART is an unstable classifier resulting in significant changes in tree structure for small changes in the learning set (Breiman, Friedman et al. 1984; Breiman 1994). To address this problem, research into combining classifiers has increased significantly in the last few years (Breiman 1992; Wolpert 1992; Breiman 1994). These methods are of two basic types: concatenation uses the output from one classifier as input to the next classifier; parallel classifiers work on the same input data with the output from each classifier combined using regression or vote-counting techniques (Schurman 1996). These strategies greatly improve the predictive power of unstable classifiers. However, when the goal of the statistical analysis is to learn about the relationship between outcome and predictors, these strategies for combining classifiers are unacceptable since they produce a large number of trees, making interpretation difficult. We present a new method for combining classification trees which results in a single, interpretable tree. We begin by defining a distance metric between two trees based on the amount of rearrangement needed so that the structure of the two trees is identical. Using this distance metric, we develop the concept of the central, or median, tree structure and estimate it using a consensus rule. This tree is seen to be more centrally located than the tree fit to all the data. We finish by discussing future work including alternative methods for estimating the median tree, probability models, uses in data mining and meta-analysis, and performance measurements of the median tree.

Cite this Paper


BibTeX
@InProceedings{pmlr-vR1-shannon97a, title = {A Distance Metric for Classification Trees}, author = {Shannon, William D. and Banks, David}, booktitle = {Proceedings of the Sixth International Workshop on Artificial Intelligence and Statistics}, pages = {457--464}, year = {1997}, editor = {Madigan, David and Smyth, Padhraic}, volume = {R1}, series = {Proceedings of Machine Learning Research}, month = {04--07 Jan}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/r1/shannon97a/shannon97a.pdf}, url = {https://proceedings.mlr.press/r1/shannon97a.html}, abstract = {CART is an unstable classifier resulting in significant changes in tree structure for small changes in the learning set (Breiman, Friedman et al. 1984; Breiman 1994). To address this problem, research into combining classifiers has increased significantly in the last few years (Breiman 1992; Wolpert 1992; Breiman 1994). These methods are of two basic types: concatenation uses the output from one classifier as input to the next classifier; parallel classifiers work on the same input data with the output from each classifier combined using regression or vote-counting techniques (Schurman 1996). These strategies greatly improve the predictive power of unstable classifiers. However, when the goal of the statistical analysis is to learn about the relationship between outcome and predictors, these strategies for combining classifiers are unacceptable since they produce a large number of trees, making interpretation difficult. We present a new method for combining classification trees which results in a single, interpretable tree. We begin by defining a distance metric between two trees based on the amount of rearrangement needed so that the structure of the two trees is identical. Using this distance metric, we develop the concept of the central, or median, tree structure and estimate it using a consensus rule. This tree is seen to be more centrally located than the tree fit to all the data. We finish by discussing future work including alternative methods for estimating the median tree, probability models, uses in data mining and meta-analysis, and performance measurements of the median tree.}, note = {Reissued by PMLR on 30 March 2021.} }
Endnote
%0 Conference Paper %T A Distance Metric for Classification Trees %A William D. Shannon %A David Banks %B Proceedings of the Sixth International Workshop on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 1997 %E David Madigan %E Padhraic Smyth %F pmlr-vR1-shannon97a %I PMLR %P 457--464 %U https://proceedings.mlr.press/r1/shannon97a.html %V R1 %X CART is an unstable classifier resulting in significant changes in tree structure for small changes in the learning set (Breiman, Friedman et al. 1984; Breiman 1994). To address this problem, research into combining classifiers has increased significantly in the last few years (Breiman 1992; Wolpert 1992; Breiman 1994). These methods are of two basic types: concatenation uses the output from one classifier as input to the next classifier; parallel classifiers work on the same input data with the output from each classifier combined using regression or vote-counting techniques (Schurman 1996). These strategies greatly improve the predictive power of unstable classifiers. However, when the goal of the statistical analysis is to learn about the relationship between outcome and predictors, these strategies for combining classifiers are unacceptable since they produce a large number of trees, making interpretation difficult. We present a new method for combining classification trees which results in a single, interpretable tree. We begin by defining a distance metric between two trees based on the amount of rearrangement needed so that the structure of the two trees is identical. Using this distance metric, we develop the concept of the central, or median, tree structure and estimate it using a consensus rule. This tree is seen to be more centrally located than the tree fit to all the data. We finish by discussing future work including alternative methods for estimating the median tree, probability models, uses in data mining and meta-analysis, and performance measurements of the median tree. %Z Reissued by PMLR on 30 March 2021.
APA
Shannon, W.D. & Banks, D.. (1997). A Distance Metric for Classification Trees. Proceedings of the Sixth International Workshop on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research R1:457-464 Available from https://proceedings.mlr.press/r1/shannon97a.html. Reissued by PMLR on 30 March 2021.

Related Material