An Exact Probability Metric for Decision Tree Splitting and Stopping
Pre-proceedings of the Fifth International Workshop on Artificial Intelligence and Statistics, PMLR R0:379-385, 1995.
ID3’s information gain heuristic  is well-known to be biased towards multi-valued attributes. This bias is only partially compensated by the gain ratio used in C4.5 . Several alternatives have been proposed, notably orthogonality , and Beta . Gain ratio and orthogonality are strongly correlated, and all of the metrics share a common bias towards splits with one or more small expected values, under circumstances where the split likely ocurred by chance. Both classical and Bayesian statistics lead to the multiple hypergeometric distribution as the posterior probability of the null hypothesis. Both gain and the chi-squared significance test are shown to arise in asymptotic approximations to the hypergeometric, revealing similar criteria for admissibility and showing the nature of their biases. Previous failures to find admissible stopping rules in CART [3, pp 59-66] and ID3 [20, pp 36-37] are traced to coupling these biased approximations with one another or with arbitrary thresholds; problems which are overcome by the hypergeometric. Empirical results show that pre-pruning should be done, as trees pruned in this way are simpler, more efficient, and no less accurate than unpruned trees. Average training time is reduced by up to $30 %$, and expensive post-pruning avoided.