[edit]
Cross-validated Likelihood for Model Selection in Unsupervised Learning
Proceedings of the Sixth International Workshop on Artificial Intelligence and Statistics, PMLR R1:473-480, 1997.
Abstract
Cross-validation is a well-known technique in supervised learning to select a model from a family of candidate models. Examples include selecting the best classification tree using cross-validated classification error (Breiman et al., 1984) and variable selection in linear regression using cross-validated predictive squared error (Hjort, 1995). Cross-validation is less seldomly used in \emph{unsupervised} learning such as clustering. It is popular in kernel density estimation for choosing the smoothing parameter (the kernel bandwidth). However, it does not appear to have been used for the problem of determining cluster structure in clustering problems, i.e., solving the problem of how many clusters to fit to a given data set. This is the problem addressed in this paper. Cross-validated likelihood can be viewed as an appropriate metric for model selection in probabilistic clustering, in particular for finite mixture models. In this paper, the use of cross-validated likelihood for clustering is investigated a.nd the method is applied to a real problem where "truth" is unknown, i.e., determining the number of intrinsic "regimes" in records of upper atmosphere pressure taken daily since 1948 over the Northern Hemisphere.