Proceedings of the Sixth International Workshop on Artificial Intelligence and Statistics, PMLR R1:115-122, 1997.
Overfitting arises when model components are evaluated against the wrong reference distribution. Most modeling algorithms iteratively find the best of several components and then test whether this component is good enough to add to the model. We show that for independently distributed random variables, the reference distribution for any one variable underestimates the reference distribution for the the highest-valued variable; thus variate values will appear significant when they are not, and model components will be added when they should not be added. We relate this problem to the well-known statistical theory of multiple comparisons or simultaneous inference.