Cross-validated Likelihood for Model Selection in Unsupervised Learning

Padhraic Smyth
Proceedings of the Sixth International Workshop on Artificial Intelligence and Statistics, PMLR R1:473-480, 1997.

Abstract

Cross-validation is a well-known technique in supervised learning to select a model from a family of candidate models. Examples include selecting the best classification tree using cross-validated classification error (Breiman et al., 1984) and variable selection in linear regression using cross-validated predictive squared error (Hjort, 1995). Cross-validation is less seldomly used in \emph{unsupervised} learning such as clustering. It is popular in kernel density estimation for choosing the smoothing parameter (the kernel bandwidth). However, it does not appear to have been used for the problem of determining cluster structure in clustering problems, i.e., solving the problem of how many clusters to fit to a given data set. This is the problem addressed in this paper. Cross-validated likelihood can be viewed as an appropriate metric for model selection in probabilistic clustering, in particular for finite mixture models. In this paper, the use of cross-validated likelihood for clustering is investigated a.nd the method is applied to a real problem where "truth" is unknown, i.e., determining the number of intrinsic "regimes" in records of upper atmosphere pressure taken daily since 1948 over the Northern Hemisphere.

Cite this Paper


BibTeX
@InProceedings{pmlr-vR1-smyth97a, title = {Cross-validated Likelihood for Model Selection in Unsupervised Learning}, author = {Smyth, Padhraic}, booktitle = {Proceedings of the Sixth International Workshop on Artificial Intelligence and Statistics}, pages = {473--480}, year = {1997}, editor = {Madigan, David and Smyth, Padhraic}, volume = {R1}, series = {Proceedings of Machine Learning Research}, month = {04--07 Jan}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/r1/smyth97a/smyth97a.pdf}, url = {https://proceedings.mlr.press/r1/smyth97a.html}, abstract = {Cross-validation is a well-known technique in supervised learning to select a model from a family of candidate models. Examples include selecting the best classification tree using cross-validated classification error (Breiman et al., 1984) and variable selection in linear regression using cross-validated predictive squared error (Hjort, 1995). Cross-validation is less seldomly used in \emph{unsupervised} learning such as clustering. It is popular in kernel density estimation for choosing the smoothing parameter (the kernel bandwidth). However, it does not appear to have been used for the problem of determining cluster structure in clustering problems, i.e., solving the problem of how many clusters to fit to a given data set. This is the problem addressed in this paper. Cross-validated likelihood can be viewed as an appropriate metric for model selection in probabilistic clustering, in particular for finite mixture models. In this paper, the use of cross-validated likelihood for clustering is investigated a.nd the method is applied to a real problem where "truth" is unknown, i.e., determining the number of intrinsic "regimes" in records of upper atmosphere pressure taken daily since 1948 over the Northern Hemisphere.}, note = {Reissued by PMLR on 30 March 2021.} }
Endnote
%0 Conference Paper %T Cross-validated Likelihood for Model Selection in Unsupervised Learning %A Padhraic Smyth %B Proceedings of the Sixth International Workshop on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 1997 %E David Madigan %E Padhraic Smyth %F pmlr-vR1-smyth97a %I PMLR %P 473--480 %U https://proceedings.mlr.press/r1/smyth97a.html %V R1 %X Cross-validation is a well-known technique in supervised learning to select a model from a family of candidate models. Examples include selecting the best classification tree using cross-validated classification error (Breiman et al., 1984) and variable selection in linear regression using cross-validated predictive squared error (Hjort, 1995). Cross-validation is less seldomly used in \emph{unsupervised} learning such as clustering. It is popular in kernel density estimation for choosing the smoothing parameter (the kernel bandwidth). However, it does not appear to have been used for the problem of determining cluster structure in clustering problems, i.e., solving the problem of how many clusters to fit to a given data set. This is the problem addressed in this paper. Cross-validated likelihood can be viewed as an appropriate metric for model selection in probabilistic clustering, in particular for finite mixture models. In this paper, the use of cross-validated likelihood for clustering is investigated a.nd the method is applied to a real problem where "truth" is unknown, i.e., determining the number of intrinsic "regimes" in records of upper atmosphere pressure taken daily since 1948 over the Northern Hemisphere. %Z Reissued by PMLR on 30 March 2021.
APA
Smyth, P.. (1997). Cross-validated Likelihood for Model Selection in Unsupervised Learning. Proceedings of the Sixth International Workshop on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research R1:473-480 Available from https://proceedings.mlr.press/r1/smyth97a.html. Reissued by PMLR on 30 March 2021.

Related Material