[edit]
Which method learns most from the data? Methodological Issues in the Analysis of Comparative Studies
Pre-proceedings of the Fifth International Workshop on Artificial Intelligence and Statistics, PMLR R0:219-225, 1995.
Abstract
The mutual discovery of the statistical and artificial intelligence communities (see e.g. [Han93, CO94]) has resulted in many studies which compare the performance of statistical and machine learning methods on empirical data sets; examples are the StatLog project ([MST94]) and the Santa Fe Time Series Competition ([WG94]), as well as numerous journal articles ([KWR93, RABCK93, WHR90, TAF91, TK92, FG93]). What has struck us is the casual manner comparisons are typically carried out in the literature. The ranking of $k$ preselected methods is performed by training (estimating in statistical terminology) them on a single data set, and estimating their respective mean prediction errors (MPE) from a hold-out sample. The methods are, subsequently, ranked according to their estimated MPEs. When the total number of observations is small, usually cross-validation rather than a hold-out sample is used to estimate the mean prediction errors. A more rigourous comparison of methods should include significance testing rather than giving a mere ranking based on the estimated MPEs. The statistical analysis of comparative studies, method ranking in particular, is addressed in this paper. Specifically, we address methodological issues of studies in which the performance of several regression or classification methods is compared on empirical data sets.