Finding Overlapping Distributions with MML
Proceedings of the Sixth International Workshop on Artificial Intelligence and Statistics, PMLR R1:23-30, 1997.
This paper considers an aspect of mixture modelling. Previous studies have shown minimum message length (MML) estimation to perform well in a wide variety of mixture modelling problems, including determining the number of com- ponents which best describes some data. In this paper, we focus on the difficult problem of overlapping components. An advantage of the probabilistic mixture modelling approach is its ability to identify models where the components overlap and data items can belong prob- abilistically to more than one component. Significantly overlapping distributions require more data for their parameters to be accurately estimated than well sep- arated distributions. For example, two Gaussian distributions are considered to significantly overlap when their means are within three standard deviations of each other. If insufficient data is available, only a single component distribution will be estimated, although the data originates from two component distributions. In this paper, we quantify this difficulty in terms of the number of data items needed for the MML criterion to ’discover’ two overlapping components. First, we perform experiments which compare the MML criterion’s performance relative to other Bayesian criteria based on MCMC sampling. Second, we make two alterations to the existing MML estimates in order to improve its performance on overlapping distributions. Experiments are performed with the new estimates to confirm that they are effective.