Overfitting Explained

Paul R. Cohen, David Jensen
Proceedings of the Sixth International Workshop on Artificial Intelligence and Statistics, PMLR R1:115-122, 1997.

Abstract

Overfitting arises when model components are evaluated against the wrong reference distribution. Most modeling algorithms iteratively find the best of several components and then test whether this component is good enough to add to the model. We show that for independently distributed random variables, the reference distribution for any one variable underestimates the reference distribution for the the highest-valued variable; thus variate values will appear significant when they are not, and model components will be added when they should not be added. We relate this problem to the well-known statistical theory of multiple comparisons or simultaneous inference.

Cite this Paper


BibTeX
@InProceedings{pmlr-vR1-cohen97a, title = {Overfitting Explained}, author = {Cohen, Paul R. and Jensen, David}, booktitle = {Proceedings of the Sixth International Workshop on Artificial Intelligence and Statistics}, pages = {115--122}, year = {1997}, editor = {Madigan, David and Smyth, Padhraic}, volume = {R1}, series = {Proceedings of Machine Learning Research}, month = {04--07 Jan}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/r1/cohen97a/cohen97a.pdf}, url = {https://proceedings.mlr.press/r1/cohen97a.html}, abstract = {Overfitting arises when model components are evaluated against the wrong reference distribution. Most modeling algorithms iteratively find the best of several components and then test whether this component is good enough to add to the model. We show that for independently distributed random variables, the reference distribution for any one variable underestimates the reference distribution for the the highest-valued variable; thus variate values will appear significant when they are not, and model components will be added when they should not be added. We relate this problem to the well-known statistical theory of multiple comparisons or simultaneous inference.}, note = {Reissued by PMLR on 30 March 2021.} }
Endnote
%0 Conference Paper %T Overfitting Explained %A Paul R. Cohen %A David Jensen %B Proceedings of the Sixth International Workshop on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 1997 %E David Madigan %E Padhraic Smyth %F pmlr-vR1-cohen97a %I PMLR %P 115--122 %U https://proceedings.mlr.press/r1/cohen97a.html %V R1 %X Overfitting arises when model components are evaluated against the wrong reference distribution. Most modeling algorithms iteratively find the best of several components and then test whether this component is good enough to add to the model. We show that for independently distributed random variables, the reference distribution for any one variable underestimates the reference distribution for the the highest-valued variable; thus variate values will appear significant when they are not, and model components will be added when they should not be added. We relate this problem to the well-known statistical theory of multiple comparisons or simultaneous inference. %Z Reissued by PMLR on 30 March 2021.
APA
Cohen, P.R. & Jensen, D.. (1997). Overfitting Explained. Proceedings of the Sixth International Workshop on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research R1:115-122 Available from https://proceedings.mlr.press/r1/cohen97a.html. Reissued by PMLR on 30 March 2021.

Related Material