Fair and Fast k-Center Clustering for Data Summarization

Haris Angelidakis, Adam Kurpisz, Leon Sering, Rico Zenklusen
Proceedings of the 39th International Conference on Machine Learning, PMLR 162:669-702, 2022.

Abstract

We consider two key issues faced by many clustering methods when used for data summarization, namely (a) an unfair representation of "demographic groups” and (b) distorted summarizations, where data points in the summary represent subsets of the original data of vastly different sizes. Previous work made important steps towards handling separately each of these two issues in the context of the fundamental k-Center clustering objective through the study of fast algorithms for natural models that address them. We show that it is possible to effectively address both (a) and (b) simultaneously by presenting a clustering procedure that works for a canonical combined model and (i) is fast, both in theory and practice, (ii) exhibits a worst-case constant-factor guarantee, and (iii) gives promising computational results showing that there can be significant benefits in addressing both issues together instead of sequentially.

Cite this Paper


BibTeX
@InProceedings{pmlr-v162-angelidakis22a, title = {Fair and Fast k-Center Clustering for Data Summarization}, author = {Angelidakis, Haris and Kurpisz, Adam and Sering, Leon and Zenklusen, Rico}, booktitle = {Proceedings of the 39th International Conference on Machine Learning}, pages = {669--702}, year = {2022}, editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan}, volume = {162}, series = {Proceedings of Machine Learning Research}, month = {17--23 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v162/angelidakis22a/angelidakis22a.pdf}, url = {https://proceedings.mlr.press/v162/angelidakis22a.html}, abstract = {We consider two key issues faced by many clustering methods when used for data summarization, namely (a) an unfair representation of "demographic groups” and (b) distorted summarizations, where data points in the summary represent subsets of the original data of vastly different sizes. Previous work made important steps towards handling separately each of these two issues in the context of the fundamental k-Center clustering objective through the study of fast algorithms for natural models that address them. We show that it is possible to effectively address both (a) and (b) simultaneously by presenting a clustering procedure that works for a canonical combined model and (i) is fast, both in theory and practice, (ii) exhibits a worst-case constant-factor guarantee, and (iii) gives promising computational results showing that there can be significant benefits in addressing both issues together instead of sequentially.} }
Endnote
%0 Conference Paper %T Fair and Fast k-Center Clustering for Data Summarization %A Haris Angelidakis %A Adam Kurpisz %A Leon Sering %A Rico Zenklusen %B Proceedings of the 39th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2022 %E Kamalika Chaudhuri %E Stefanie Jegelka %E Le Song %E Csaba Szepesvari %E Gang Niu %E Sivan Sabato %F pmlr-v162-angelidakis22a %I PMLR %P 669--702 %U https://proceedings.mlr.press/v162/angelidakis22a.html %V 162 %X We consider two key issues faced by many clustering methods when used for data summarization, namely (a) an unfair representation of "demographic groups” and (b) distorted summarizations, where data points in the summary represent subsets of the original data of vastly different sizes. Previous work made important steps towards handling separately each of these two issues in the context of the fundamental k-Center clustering objective through the study of fast algorithms for natural models that address them. We show that it is possible to effectively address both (a) and (b) simultaneously by presenting a clustering procedure that works for a canonical combined model and (i) is fast, both in theory and practice, (ii) exhibits a worst-case constant-factor guarantee, and (iii) gives promising computational results showing that there can be significant benefits in addressing both issues together instead of sequentially.
APA
Angelidakis, H., Kurpisz, A., Sering, L. & Zenklusen, R.. (2022). Fair and Fast k-Center Clustering for Data Summarization. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:669-702 Available from https://proceedings.mlr.press/v162/angelidakis22a.html.

Related Material