[edit]
Robust Sparsification via Sensitivity
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:26446-26463, 2025.
Abstract
Robustness to outliers is important in machine learning. Many classical problems, including subspace embedding, clustering, and low-rank approximation, lack scalable, outlier-resilient algorithms. This paper considers machine learning problems of the form $\min_{x\in \mathbb{R}^d} F(x)$, where $F(x)=\sum_{i=1}^n F_i(x)$, and their robust counterparts $\min_{x\in\mathbb{R}^d} F^{(m)}(x)$, where $F^{(m)}(x)$ denotes the sum of all but the $m$ largest $F_i(x)$ values. We develop a general framework for constructing $\epsilon$-coresets for such robust problems, where an $\epsilon$-coreset is a weighted subset of functions $\{F_1(x),…,F_n(x)\}$ that provides a $(1+\epsilon)$-approximation to $F(x)$. Specifically, if the original problem $F$ has total sensitivity $T$ and admits a vanilla $\epsilon$-coreset of size $S$, our algorithm constructs an $\epsilon$-coreset of size $\tilde{O}(\frac{mT}{\epsilon})+S$ for the robust objective $F^{(m)}$. This coreset size can be shown to be near-tight for $\ell_2$ subspace embedding. Our coreset algorithm has scalable running time and leads to new or improved algorithms for the robust optimization problems. Empirical evaluations demonstrate that our coresets outperform uniform sampling on real-world data sets.