Geometric Median (GM) Matching for Robust k-Subset Selection from Noisy Data

Anish Acharya, Sujay Sanghavi, Alex Dimakis, Inderjit S Dhillon
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:372-419, 2025.

Abstract

Data pruning – the combinatorial task of selecting a small and representative subset from a large dataset, is crucial for mitigating the enormous computational costs associated with training data-hungry modern deep learning models at scale. Since large-scale data collections are invariably noisy, developing data pruning strategies that remain robust even in the presence of corruption is critical in practice. Existing data pruning methods often fail under high corruption rates due to their reliance on empirical mean estimation, which is highly sensitive to outliers. In response, this work proposes Geometric Median (GM) Matching, a novel k-subset selection strategy that leverages the Geometric Median (GM) , a robust estimator with an optimal breakdown point of 1/2; to enhance resilience against noisy data. Our method iteratively selects a $k$-subset such that the mean of the subset approximates the GM of the (potentially) noisy dataset, ensuring robustness even under arbitrary corruption. We provide theoretical guarantees, showing that GM Matching enjoys an improved $\mathcal{O}(1/k)$ convergence rate, outperforming $\mathcal{O}(1/\sqrt{k})$ scaling of uniform sampling, even under arbitrary corruption. Extensive experiments across image classification and image generation tasks demonstrate that GM Matching consistently outperforms existing pruning approaches, particularly in high-corruption settings; making it a strong baseline for robust data pruning.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-acharya25b, title = {Geometric Median ({GM}) Matching for Robust k-Subset Selection from Noisy Data}, author = {Acharya, Anish and Sanghavi, Sujay and Dimakis, Alex and Dhillon, Inderjit S}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {372--419}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/acharya25b/acharya25b.pdf}, url = {https://proceedings.mlr.press/v267/acharya25b.html}, abstract = {Data pruning – the combinatorial task of selecting a small and representative subset from a large dataset, is crucial for mitigating the enormous computational costs associated with training data-hungry modern deep learning models at scale. Since large-scale data collections are invariably noisy, developing data pruning strategies that remain robust even in the presence of corruption is critical in practice. Existing data pruning methods often fail under high corruption rates due to their reliance on empirical mean estimation, which is highly sensitive to outliers. In response, this work proposes Geometric Median (GM) Matching, a novel k-subset selection strategy that leverages the Geometric Median (GM) , a robust estimator with an optimal breakdown point of 1/2; to enhance resilience against noisy data. Our method iteratively selects a $k$-subset such that the mean of the subset approximates the GM of the (potentially) noisy dataset, ensuring robustness even under arbitrary corruption. We provide theoretical guarantees, showing that GM Matching enjoys an improved $\mathcal{O}(1/k)$ convergence rate, outperforming $\mathcal{O}(1/\sqrt{k})$ scaling of uniform sampling, even under arbitrary corruption. Extensive experiments across image classification and image generation tasks demonstrate that GM Matching consistently outperforms existing pruning approaches, particularly in high-corruption settings; making it a strong baseline for robust data pruning.} }
Endnote
%0 Conference Paper %T Geometric Median (GM) Matching for Robust k-Subset Selection from Noisy Data %A Anish Acharya %A Sujay Sanghavi %A Alex Dimakis %A Inderjit S Dhillon %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-acharya25b %I PMLR %P 372--419 %U https://proceedings.mlr.press/v267/acharya25b.html %V 267 %X Data pruning – the combinatorial task of selecting a small and representative subset from a large dataset, is crucial for mitigating the enormous computational costs associated with training data-hungry modern deep learning models at scale. Since large-scale data collections are invariably noisy, developing data pruning strategies that remain robust even in the presence of corruption is critical in practice. Existing data pruning methods often fail under high corruption rates due to their reliance on empirical mean estimation, which is highly sensitive to outliers. In response, this work proposes Geometric Median (GM) Matching, a novel k-subset selection strategy that leverages the Geometric Median (GM) , a robust estimator with an optimal breakdown point of 1/2; to enhance resilience against noisy data. Our method iteratively selects a $k$-subset such that the mean of the subset approximates the GM of the (potentially) noisy dataset, ensuring robustness even under arbitrary corruption. We provide theoretical guarantees, showing that GM Matching enjoys an improved $\mathcal{O}(1/k)$ convergence rate, outperforming $\mathcal{O}(1/\sqrt{k})$ scaling of uniform sampling, even under arbitrary corruption. Extensive experiments across image classification and image generation tasks demonstrate that GM Matching consistently outperforms existing pruning approaches, particularly in high-corruption settings; making it a strong baseline for robust data pruning.
APA
Acharya, A., Sanghavi, S., Dimakis, A. & Dhillon, I.S.. (2025). Geometric Median (GM) Matching for Robust k-Subset Selection from Noisy Data. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:372-419 Available from https://proceedings.mlr.press/v267/acharya25b.html.

Related Material