[edit]
A Distance Metric for Classification Trees
Proceedings of the Sixth International Workshop on Artificial Intelligence and Statistics, PMLR R1:457-464, 1997.
Abstract
CART is an unstable classifier resulting in significant changes in tree structure for small changes in the learning set (Breiman, Friedman et al. 1984; Breiman 1994). To address this problem, research into combining classifiers has increased significantly in the last few years (Breiman 1992; Wolpert 1992; Breiman 1994). These methods are of two basic types: concatenation uses the output from one classifier as input to the next classifier; parallel classifiers work on the same input data with the output from each classifier combined using regression or vote-counting techniques (Schurman 1996). These strategies greatly improve the predictive power of unstable classifiers. However, when the goal of the statistical analysis is to learn about the relationship between outcome and predictors, these strategies for combining classifiers are unacceptable since they produce a large number of trees, making interpretation difficult. We present a new method for combining classification trees which results in a single, interpretable tree. We begin by defining a distance metric between two trees based on the amount of rearrangement needed so that the structure of the two trees is identical. Using this distance metric, we develop the concept of the central, or median, tree structure and estimate it using a consensus rule. This tree is seen to be more centrally located than the tree fit to all the data. We finish by discussing future work including alternative methods for estimating the median tree, probability models, uses in data mining and meta-analysis, and performance measurements of the median tree.