Standardized Mutual Information for Clustering Comparisons: One Step Further in Adjustment for Chance

Simone Romano; James Bailey; Vinh Nguyen; Karin Verspoor

Standardized Mutual Information for Clustering Comparisons: One Step Further in Adjustment for Chance

Simone Romano, James Bailey, Vinh Nguyen, Karin Verspoor

Proceedings of the 31st International Conference on Machine Learning, PMLR 32(2):1143-1151, 2014.

Abstract

Mutual information is a very popular measure for comparing clusterings. Previous work has shown that it is beneficial to make an adjustment for chance to this measure, by subtracting an expected value and normalizing via an upper bound. This yields the constant baseline property that enhances intuitiveness. In this paper, we argue that a further type of statistical adjustment for the mutual information is also beneficial - an adjustment to correct selection bias. This type of adjustment is useful when carrying out many clustering comparisons, to select one or more preferred clusterings. It reduces the tendency for the mutual information to choose clustering solutions i) with more clusters, or ii) induced on fewer data points, when compared to a reference one. We term our new adjusted measure the *standardized mutual information*. It requires computation of the variance of mutual information under a hypergeometric model of randomness, which is technically challenging. We derive an analytical formula for this variance and analyze its complexity. We then experimentally assess how our new measure can address selection bias and also increase interpretability. We recommend using the standardized mutual information when making multiple clustering comparisons in situations where the number of records is small compared to the number of clusters considered.

Cite this Paper

BibTeX


@InProceedings{pmlr-v32-romano14,
  title = 	 {Standardized Mutual Information for Clustering Comparisons: One Step Further in Adjustment for Chance},
  author = 	 {Romano, Simone and Bailey, James and Nguyen, Vinh and Verspoor, Karin},
  booktitle = 	 {Proceedings of the 31st International Conference on Machine Learning},
  pages = 	 {1143--1151},
  year = 	 {2014},
  editor = 	 {Xing, Eric P. and Jebara, Tony},
  volume = 	 {32},
  number =       {2},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Bejing, China},
  month = 	 {22--24 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v32/romano14.pdf},
  url = 	 {https://proceedings.mlr.press/v32/romano14.html},
  abstract = 	 {Mutual information is a very popular measure for comparing clusterings. Previous work has shown that it is beneficial to make an adjustment for chance to this measure, by subtracting an expected value and normalizing via an upper bound. This yields the constant baseline property that enhances intuitiveness. In this paper, we argue that a further type of statistical adjustment for the mutual information is also beneficial - an adjustment to correct selection bias. This type of adjustment is useful when carrying out many clustering comparisons, to select one or more preferred clusterings. It reduces the tendency for the mutual information to choose clustering solutions i) with more clusters, or ii) induced on fewer data points, when compared to a reference one. We term our new adjusted measure the *standardized mutual information*. It requires computation of the variance of mutual information under a hypergeometric model of randomness, which is technically challenging. We derive an analytical formula for this variance and analyze its complexity. We then experimentally assess how our new measure can address selection bias and also increase interpretability. We recommend using the standardized mutual information when making multiple clustering comparisons in situations where the number of records is small compared to the number of clusters considered.}
}

Endnote

%0 Conference Paper
%T Standardized Mutual Information for Clustering Comparisons: One Step Further in Adjustment for Chance
%A Simone Romano
%A James Bailey
%A Vinh Nguyen
%A Karin Verspoor
%B Proceedings of the 31st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2014
%E Eric P. Xing
%E Tony Jebara	
%F pmlr-v32-romano14
%I PMLR
%P 1143--1151
%U https://proceedings.mlr.press/v32/romano14.html
%V 32
%N 2
%X Mutual information is a very popular measure for comparing clusterings. Previous work has shown that it is beneficial to make an adjustment for chance to this measure, by subtracting an expected value and normalizing via an upper bound. This yields the constant baseline property that enhances intuitiveness. In this paper, we argue that a further type of statistical adjustment for the mutual information is also beneficial - an adjustment to correct selection bias. This type of adjustment is useful when carrying out many clustering comparisons, to select one or more preferred clusterings. It reduces the tendency for the mutual information to choose clustering solutions i) with more clusters, or ii) induced on fewer data points, when compared to a reference one. We term our new adjusted measure the *standardized mutual information*. It requires computation of the variance of mutual information under a hypergeometric model of randomness, which is technically challenging. We derive an analytical formula for this variance and analyze its complexity. We then experimentally assess how our new measure can address selection bias and also increase interpretability. We recommend using the standardized mutual information when making multiple clustering comparisons in situations where the number of records is small compared to the number of clusters considered.

RIS


TY  - CPAPER
TI  - Standardized Mutual Information for Clustering Comparisons: One Step Further in Adjustment for Chance
AU  - Simone Romano
AU  - James Bailey
AU  - Vinh Nguyen
AU  - Karin Verspoor
BT  - Proceedings of the 31st International Conference on Machine Learning
DA  - 2014/06/18
ED  - Eric P. Xing
ED  - Tony Jebara	
ID  - pmlr-v32-romano14
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 32
IS  - 2
SP  - 1143
EP  - 1151
L1  - http://proceedings.mlr.press/v32/romano14.pdf
UR  - https://proceedings.mlr.press/v32/romano14.html
AB  - Mutual information is a very popular measure for comparing clusterings. Previous work has shown that it is beneficial to make an adjustment for chance to this measure, by subtracting an expected value and normalizing via an upper bound. This yields the constant baseline property that enhances intuitiveness. In this paper, we argue that a further type of statistical adjustment for the mutual information is also beneficial - an adjustment to correct selection bias. This type of adjustment is useful when carrying out many clustering comparisons, to select one or more preferred clusterings. It reduces the tendency for the mutual information to choose clustering solutions i) with more clusters, or ii) induced on fewer data points, when compared to a reference one. We term our new adjusted measure the *standardized mutual information*. It requires computation of the variance of mutual information under a hypergeometric model of randomness, which is technically challenging. We derive an analytical formula for this variance and analyze its complexity. We then experimentally assess how our new measure can address selection bias and also increase interpretability. We recommend using the standardized mutual information when making multiple clustering comparisons in situations where the number of records is small compared to the number of clusters considered.
ER  -

APA


Romano, S., Bailey, J., Nguyen, V. & Verspoor, K.. (2014). Standardized Mutual Information for Clustering Comparisons: One Step Further in Adjustment for Chance. Proceedings of the 31st International Conference on Machine Learning, in Proceedings of Machine Learning Research 32(2):1143-1151 Available from https://proceedings.mlr.press/v32/romano14.html.

Standardized Mutual Information for Clustering Comparisons: One Step Further in Adjustment for Chance

Abstract

Cite this Paper

Related Material