Optimal bounds for $\ell_p$ sensitivity sampling via $\ell_2$ augmentation

Alexander Munteanu; Simon Omlor

Optimal bounds for $\ell_p$ sensitivity sampling via $\ell_2$ augmentation

Alexander Munteanu, Simon Omlor

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:36769-36796, 2024.

Abstract

Data subsampling is one of the most natural methods to approximate a massively large data set by a small representative proxy. In particular, sensitivity sampling received a lot of attention, which samples points proportional to an individual importance measure called sensitivity. This framework reduces in very general settings the size of data to roughly the VC dimension

$d$ times the total sensitivity

$\mathfrak S$ while providing strong

$(1\pm\varepsilon)$ guarantees on the quality of approximation. The recent work of Woodruff & Yasuda (2023c) improved substantially over the general

$\tilde O(\varepsilon^{-2}\mathfrak Sd)$ bound for the important problem of

$\ell_p$ subspace embeddings to

$\tilde O(\varepsilon^{-2}\mathfrak S^{2/p})$ for

$p\in[1,2]$ . Their result was subsumed by an earlier

$\tilde O(\varepsilon^{-2}\mathfrak Sd^{1-p/2})$ bound which was implicitly given in the work of Chen & Derezinski (2021). We show that their result is tight when sampling according to plain

$\ell_p$ sensitivities. We observe that by augmenting the

$\ell_p$ sensitivities by

$\ell_2$ sensitivities, we obtain better bounds improving over the aforementioned results to optimal linear

$\tilde O(\varepsilon^{-2}(\mathfrak S+d)) = \tilde O(\varepsilon^{-2}d)$ sampling complexity for all

$p \in [1,2]$ . In particular, this resolves an open question of Woodruff & Yasuda (2023c) in the affirmative for

$p \in [1,2]$ and brings sensitivity subsampling into the regime that was previously only known to be possible using Lewis weights (Cohen & Peng, 2015). As an application of our main result, we also obtain an

$\tilde O(\varepsilon^{-2}\mu d)$ sensitivity sampling bound for logistic regression, where

$\mu$ is a natural complexity measure for this problem. This improves over the previous

$\tilde O(\varepsilon^{-2}\mu^2 d)$ bound of Mai et al. (2021) which was based on Lewis weights subsampling.

Cite this Paper

BibTeX


@InProceedings{pmlr-v235-munteanu24a,
  title = 	 {Optimal bounds for $\ell_p$ sensitivity sampling via $\ell_2$ augmentation},
  author =       {Munteanu, Alexander and Omlor, Simon},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {36769--36796},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/munteanu24a/munteanu24a.pdf},
  url = 	 {https://proceedings.mlr.press/v235/munteanu24a.html},
  abstract = 	 {Data subsampling is one of the most natural methods to approximate a massively large data set by a small representative proxy. In particular, sensitivity sampling received a lot of attention, which samples points proportional to an individual importance measure called sensitivity. This framework reduces in very general settings the size of data to roughly the VC dimension $d$ times the total sensitivity $\mathfrak S$ while providing strong $(1\pm\varepsilon)$ guarantees on the quality of approximation. The recent work of Woodruff & Yasuda (2023c) improved substantially over the general $\tilde O(\varepsilon^{-2}\mathfrak Sd)$ bound for the important problem of $\ell_p$ subspace embeddings to $\tilde O(\varepsilon^{-2}\mathfrak S^{2/p})$ for $p\in[1,2]$. Their result was subsumed by an earlier $\tilde O(\varepsilon^{-2}\mathfrak Sd^{1-p/2})$ bound which was implicitly given in the work of Chen & Derezinski (2021). We show that their result is tight when sampling according to plain $\ell_p$ sensitivities. We observe that by augmenting the $\ell_p$ sensitivities by $\ell_2$ sensitivities, we obtain better bounds improving over the aforementioned results to optimal linear $\tilde O(\varepsilon^{-2}(\mathfrak S+d)) = \tilde O(\varepsilon^{-2}d)$ sampling complexity for all $p \in [1,2]$. In particular, this resolves an open question of Woodruff & Yasuda (2023c) in the affirmative for $p \in [1,2]$ and brings sensitivity subsampling into the regime that was previously only known to be possible using Lewis weights (Cohen & Peng, 2015). As an application of our main result, we also obtain an $\tilde O(\varepsilon^{-2}\mu d)$ sensitivity sampling bound for logistic regression, where $\mu$ is a natural complexity measure for this problem. This improves over the previous $\tilde O(\varepsilon^{-2}\mu^2 d)$ bound of Mai et al. (2021) which was based on Lewis weights subsampling.}
}

Endnote

%0 Conference Paper
%T Optimal bounds for $\ell_p$ sensitivity sampling via $\ell_2$ augmentation
%A Alexander Munteanu
%A Simon Omlor
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-munteanu24a
%I PMLR
%P 36769--36796
%U https://proceedings.mlr.press/v235/munteanu24a.html
%V 235
%X Data subsampling is one of the most natural methods to approximate a massively large data set by a small representative proxy. In particular, sensitivity sampling received a lot of attention, which samples points proportional to an individual importance measure called sensitivity. This framework reduces in very general settings the size of data to roughly the VC dimension $d$ times the total sensitivity $\mathfrak S$ while providing strong $(1\pm\varepsilon)$ guarantees on the quality of approximation. The recent work of Woodruff & Yasuda (2023c) improved substantially over the general $\tilde O(\varepsilon^{-2}\mathfrak Sd)$ bound for the important problem of $\ell_p$ subspace embeddings to $\tilde O(\varepsilon^{-2}\mathfrak S^{2/p})$ for $p\in[1,2]$. Their result was subsumed by an earlier $\tilde O(\varepsilon^{-2}\mathfrak Sd^{1-p/2})$ bound which was implicitly given in the work of Chen & Derezinski (2021). We show that their result is tight when sampling according to plain $\ell_p$ sensitivities. We observe that by augmenting the $\ell_p$ sensitivities by $\ell_2$ sensitivities, we obtain better bounds improving over the aforementioned results to optimal linear $\tilde O(\varepsilon^{-2}(\mathfrak S+d)) = \tilde O(\varepsilon^{-2}d)$ sampling complexity for all $p \in [1,2]$. In particular, this resolves an open question of Woodruff & Yasuda (2023c) in the affirmative for $p \in [1,2]$ and brings sensitivity subsampling into the regime that was previously only known to be possible using Lewis weights (Cohen & Peng, 2015). As an application of our main result, we also obtain an $\tilde O(\varepsilon^{-2}\mu d)$ sensitivity sampling bound for logistic regression, where $\mu$ is a natural complexity measure for this problem. This improves over the previous $\tilde O(\varepsilon^{-2}\mu^2 d)$ bound of Mai et al. (2021) which was based on Lewis weights subsampling.

APA


Munteanu, A. & Omlor, S.. (2024). Optimal bounds for $\ell_p$ sensitivity sampling via $\ell_2$ augmentation. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:36769-36796 Available from https://proceedings.mlr.press/v235/munteanu24a.html.

Optimal bounds for ℓp\ell_p sensitivity sampling via ℓ2\ell_2 augmentation

Abstract

Cite this Paper

Related Material

Optimal bounds for $\ell_p$ sensitivity sampling via $\ell_2$ augmentation