Standardized Mutual Information for Clustering Comparisons: One Step Further in Adjustment for Chance

Simone Romano, James Bailey, Vinh Nguyen, Karin Verspoor
Proceedings of the 31st International Conference on Machine Learning, PMLR 32(2):1143-1151, 2014.

Abstract

Mutual information is a very popular measure for comparing clusterings. Previous work has shown that it is beneficial to make an adjustment for chance to this measure, by subtracting an expected value and normalizing via an upper bound. This yields the constant baseline property that enhances intuitiveness. In this paper, we argue that a further type of statistical adjustment for the mutual information is also beneficial - an adjustment to correct selection bias. This type of adjustment is useful when carrying out many clustering comparisons, to select one or more preferred clusterings. It reduces the tendency for the mutual information to choose clustering solutions i) with more clusters, or ii) induced on fewer data points, when compared to a reference one. We term our new adjusted measure the *standardized mutual information*. It requires computation of the variance of mutual information under a hypergeometric model of randomness, which is technically challenging. We derive an analytical formula for this variance and analyze its complexity. We then experimentally assess how our new measure can address selection bias and also increase interpretability. We recommend using the standardized mutual information when making multiple clustering comparisons in situations where the number of records is small compared to the number of clusters considered.

Cite this Paper


BibTeX
@InProceedings{pmlr-v32-romano14, title = {Standardized Mutual Information for Clustering Comparisons: One Step Further in Adjustment for Chance}, author = {Romano, Simone and Bailey, James and Nguyen, Vinh and Verspoor, Karin}, booktitle = {Proceedings of the 31st International Conference on Machine Learning}, pages = {1143--1151}, year = {2014}, editor = {Xing, Eric P. and Jebara, Tony}, volume = {32}, number = {2}, series = {Proceedings of Machine Learning Research}, address = {Bejing, China}, month = {22--24 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v32/romano14.pdf}, url = {https://proceedings.mlr.press/v32/romano14.html}, abstract = {Mutual information is a very popular measure for comparing clusterings. Previous work has shown that it is beneficial to make an adjustment for chance to this measure, by subtracting an expected value and normalizing via an upper bound. This yields the constant baseline property that enhances intuitiveness. In this paper, we argue that a further type of statistical adjustment for the mutual information is also beneficial - an adjustment to correct selection bias. This type of adjustment is useful when carrying out many clustering comparisons, to select one or more preferred clusterings. It reduces the tendency for the mutual information to choose clustering solutions i) with more clusters, or ii) induced on fewer data points, when compared to a reference one. We term our new adjusted measure the *standardized mutual information*. It requires computation of the variance of mutual information under a hypergeometric model of randomness, which is technically challenging. We derive an analytical formula for this variance and analyze its complexity. We then experimentally assess how our new measure can address selection bias and also increase interpretability. We recommend using the standardized mutual information when making multiple clustering comparisons in situations where the number of records is small compared to the number of clusters considered.} }
Endnote
%0 Conference Paper %T Standardized Mutual Information for Clustering Comparisons: One Step Further in Adjustment for Chance %A Simone Romano %A James Bailey %A Vinh Nguyen %A Karin Verspoor %B Proceedings of the 31st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2014 %E Eric P. Xing %E Tony Jebara %F pmlr-v32-romano14 %I PMLR %P 1143--1151 %U https://proceedings.mlr.press/v32/romano14.html %V 32 %N 2 %X Mutual information is a very popular measure for comparing clusterings. Previous work has shown that it is beneficial to make an adjustment for chance to this measure, by subtracting an expected value and normalizing via an upper bound. This yields the constant baseline property that enhances intuitiveness. In this paper, we argue that a further type of statistical adjustment for the mutual information is also beneficial - an adjustment to correct selection bias. This type of adjustment is useful when carrying out many clustering comparisons, to select one or more preferred clusterings. It reduces the tendency for the mutual information to choose clustering solutions i) with more clusters, or ii) induced on fewer data points, when compared to a reference one. We term our new adjusted measure the *standardized mutual information*. It requires computation of the variance of mutual information under a hypergeometric model of randomness, which is technically challenging. We derive an analytical formula for this variance and analyze its complexity. We then experimentally assess how our new measure can address selection bias and also increase interpretability. We recommend using the standardized mutual information when making multiple clustering comparisons in situations where the number of records is small compared to the number of clusters considered.
RIS
TY - CPAPER TI - Standardized Mutual Information for Clustering Comparisons: One Step Further in Adjustment for Chance AU - Simone Romano AU - James Bailey AU - Vinh Nguyen AU - Karin Verspoor BT - Proceedings of the 31st International Conference on Machine Learning DA - 2014/06/18 ED - Eric P. Xing ED - Tony Jebara ID - pmlr-v32-romano14 PB - PMLR DP - Proceedings of Machine Learning Research VL - 32 IS - 2 SP - 1143 EP - 1151 L1 - http://proceedings.mlr.press/v32/romano14.pdf UR - https://proceedings.mlr.press/v32/romano14.html AB - Mutual information is a very popular measure for comparing clusterings. Previous work has shown that it is beneficial to make an adjustment for chance to this measure, by subtracting an expected value and normalizing via an upper bound. This yields the constant baseline property that enhances intuitiveness. In this paper, we argue that a further type of statistical adjustment for the mutual information is also beneficial - an adjustment to correct selection bias. This type of adjustment is useful when carrying out many clustering comparisons, to select one or more preferred clusterings. It reduces the tendency for the mutual information to choose clustering solutions i) with more clusters, or ii) induced on fewer data points, when compared to a reference one. We term our new adjusted measure the *standardized mutual information*. It requires computation of the variance of mutual information under a hypergeometric model of randomness, which is technically challenging. We derive an analytical formula for this variance and analyze its complexity. We then experimentally assess how our new measure can address selection bias and also increase interpretability. We recommend using the standardized mutual information when making multiple clustering comparisons in situations where the number of records is small compared to the number of clusters considered. ER -
APA
Romano, S., Bailey, J., Nguyen, V. & Verspoor, K.. (2014). Standardized Mutual Information for Clustering Comparisons: One Step Further in Adjustment for Chance. Proceedings of the 31st International Conference on Machine Learning, in Proceedings of Machine Learning Research 32(2):1143-1151 Available from https://proceedings.mlr.press/v32/romano14.html.

Related Material