Measures of diversity and space-filling designs for categorical data

Cedric Malherbe, Emilio Domı́nguez-Sánchez, Merwan Barlier, Igor Colin, Haitham Bou Ammar, Tom Diethe
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:34499-34528, 2024.

Abstract

Selecting a small subset of items that represent the diversity of a larger population lies at the heart of many data analysis and machine learning applications. However, when it comes to items described by discrete features, the lack of natural ordering and the combinatorial nature of the search space pose significant challenges to the current selection techniques and make existing methods ill-suited. In this paper, we propose to make a step in that direction by proposing novel methods to select subsets of diverse categorical data based on the advances in combinatorial optimization. First, we start to cast the subset selection problem through the lens of the optimization of three diversity metrics. We then provide novel bounds for this problem and present exact solvers that unfortunately come with a high computational cost. To overcome this bottleneck, we go on and show how to employ tools from linear programming and submodular optimization by introducing two computationally plausible methods that still present approximation guarantees about the diversity metrics. Finally, a numerical assessment is provided to illustrate the potential of the designs with respect to state-of-the-art methods.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-malherbe24a, title = {Measures of diversity and space-filling designs for categorical data}, author = {Malherbe, Cedric and Dom\'{\i}nguez-S\'{a}nchez, Emilio and Barlier, Merwan and Colin, Igor and Bou Ammar, Haitham and Diethe, Tom}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {34499--34528}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/malherbe24a/malherbe24a.pdf}, url = {https://proceedings.mlr.press/v235/malherbe24a.html}, abstract = {Selecting a small subset of items that represent the diversity of a larger population lies at the heart of many data analysis and machine learning applications. However, when it comes to items described by discrete features, the lack of natural ordering and the combinatorial nature of the search space pose significant challenges to the current selection techniques and make existing methods ill-suited. In this paper, we propose to make a step in that direction by proposing novel methods to select subsets of diverse categorical data based on the advances in combinatorial optimization. First, we start to cast the subset selection problem through the lens of the optimization of three diversity metrics. We then provide novel bounds for this problem and present exact solvers that unfortunately come with a high computational cost. To overcome this bottleneck, we go on and show how to employ tools from linear programming and submodular optimization by introducing two computationally plausible methods that still present approximation guarantees about the diversity metrics. Finally, a numerical assessment is provided to illustrate the potential of the designs with respect to state-of-the-art methods.} }
Endnote
%0 Conference Paper %T Measures of diversity and space-filling designs for categorical data %A Cedric Malherbe %A Emilio Domı́nguez-Sánchez %A Merwan Barlier %A Igor Colin %A Haitham Bou Ammar %A Tom Diethe %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-malherbe24a %I PMLR %P 34499--34528 %U https://proceedings.mlr.press/v235/malherbe24a.html %V 235 %X Selecting a small subset of items that represent the diversity of a larger population lies at the heart of many data analysis and machine learning applications. However, when it comes to items described by discrete features, the lack of natural ordering and the combinatorial nature of the search space pose significant challenges to the current selection techniques and make existing methods ill-suited. In this paper, we propose to make a step in that direction by proposing novel methods to select subsets of diverse categorical data based on the advances in combinatorial optimization. First, we start to cast the subset selection problem through the lens of the optimization of three diversity metrics. We then provide novel bounds for this problem and present exact solvers that unfortunately come with a high computational cost. To overcome this bottleneck, we go on and show how to employ tools from linear programming and submodular optimization by introducing two computationally plausible methods that still present approximation guarantees about the diversity metrics. Finally, a numerical assessment is provided to illustrate the potential of the designs with respect to state-of-the-art methods.
APA
Malherbe, C., Domı́nguez-Sánchez, E., Barlier, M., Colin, I., Bou Ammar, H. & Diethe, T.. (2024). Measures of diversity and space-filling designs for categorical data. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:34499-34528 Available from https://proceedings.mlr.press/v235/malherbe24a.html.

Related Material