[edit]

# Asessing and Improving Classification Rules

*Proceedings of the Sixth International Workshop on Artificial Intelligence and Statistics*, PMLR R1:243-254, 1997.

#### Abstract

The last few years have witnessed a resurgence of research effort aimed at developing improved techniques for supervised classification problems. In a large part this resurgence of interest has been stimulated by the novelty of multi-layer feedforward neural networks (Hertz et al, 1991; Ripley, 1996) and similar complex and flexible models such as MARS (Friedman, 1991), projection pursuit regression (Friedman and Stuetzle, 1981), and additive models in general (Hastie and Tibshirani, 1990)). The flexibility of these models is in striking contrast to the simplicity of models such as simple linear discriminant analysis, perceptrons, and logistic discriminant analysis, which assume highly restricted forms of decision surface. The merit of the flexibility of neural networks is countered by the dangers that they will overfit the design data. This relationship between model flexibility and the danger of overfitting has long been understood within the statistical community. For example, in the 1960s the optimistic bias of resubstitution error rate became widely recognised and it was replaced by the leave-one-out method as the method of choice. (Later, in the 1980s, an apparently large variance of the latter led to its being abandoned in favour of bootstrap methods, in particular the 632 bootstrap.) Early work on neural networks also fell into this trap, producing inflated claims of the performance of such models .derived from optimistic performance measures based on overfitting the design set. In recent years the risk has been recognised, and some sophisticated proposals have been made for overcoming the problem. They are based on ideas such as penalising the goodness of fit measure (by combining it with a measure of model complexity), restricting the form of the model (to few nodes in a network, for example), shrinking an overfitted model .(by weight decay, for example), or even by adding randomly perturbed replicates to the design set. The problem with all such methods is \emph{how} to strike the optimum compromise between modelling the design data and overfitting.