Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond

Kyriakos Axiotis; Vincent Cohen-Addad; Monika Henzinger; Sammy Jerome; Vahab Mirrokni; David Saulpic; David Woodruff; Michael Wunder

Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond

Kyriakos Axiotis, Vincent Cohen-Addad, Monika Henzinger, Sammy Jerome, Vahab Mirrokni, David Saulpic, David Woodruff, Michael Wunder

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:2086-2107, 2024.

Abstract

We study the data selection problem, whose aim is to select a small representative subset of data that can be used to efficiently train a machine learning model. We present a new data selection approach based on

$k$ -means clustering and sensitivity sampling. Assuming access to an embedding representation of the data with respect to which the model loss is Holder continuous, our approach provably allows selecting a set of “typical”

$k + 1/\varepsilon^2$ elements whose average loss corresponds to the average loss of the whole dataset, up to a multiplicative

$(1\pm\varepsilon)$ factor and an additive

$\varepsilon \lambda \Phi_k$ , where

$\Phi_k$ represents the

$k$ -means cost for the input embeddings and

$\lambda$ is the Holder constant. We furthermore demonstrate the performance and scalability of our approach on fine-tuning foundation models and show that it outperforms state-of-the-art methods. We also show how it can be applied on linear regression, leading to a new sampling strategy that surprisingly matches the performance of leverage score sampling, while being conceptually simpler and more scalable.

Cite this Paper

BibTeX


@InProceedings{pmlr-v235-axiotis24a,
  title = 	 {Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond},
  author =       {Axiotis, Kyriakos and Cohen-Addad, Vincent and Henzinger, Monika and Jerome, Sammy and Mirrokni, Vahab and Saulpic, David and Woodruff, David and Wunder, Michael},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {2086--2107},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/axiotis24a/axiotis24a.pdf},
  url = 	 {https://proceedings.mlr.press/v235/axiotis24a.html},
  abstract = 	 {We study the data selection problem, whose aim is to select a small representative subset of data that can be used to efficiently train a machine learning model. We present a new data selection approach based on $k$-means clustering and sensitivity sampling. Assuming access to an embedding representation of the data with respect to which the model loss is Holder continuous, our approach provably allows selecting a set of “typical” $k + 1/\varepsilon^2$ elements whose average loss corresponds to the average loss of the whole dataset, up to a multiplicative $(1\pm\varepsilon)$ factor and an additive $\varepsilon \lambda \Phi_k$, where $\Phi_k$ represents the $k$-means cost for the input embeddings and $\lambda$ is the Holder constant. We furthermore demonstrate the performance and scalability of our approach on fine-tuning foundation models and show that it outperforms state-of-the-art methods. We also show how it can be applied on linear regression, leading to a new sampling strategy that surprisingly matches the performance of leverage score sampling, while being conceptually simpler and more scalable.}
}

Endnote

%0 Conference Paper
%T Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond
%A Kyriakos Axiotis
%A Vincent Cohen-Addad
%A Monika Henzinger
%A Sammy Jerome
%A Vahab Mirrokni
%A David Saulpic
%A David Woodruff
%A Michael Wunder
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-axiotis24a
%I PMLR
%P 2086--2107
%U https://proceedings.mlr.press/v235/axiotis24a.html
%V 235
%X We study the data selection problem, whose aim is to select a small representative subset of data that can be used to efficiently train a machine learning model. We present a new data selection approach based on $k$-means clustering and sensitivity sampling. Assuming access to an embedding representation of the data with respect to which the model loss is Holder continuous, our approach provably allows selecting a set of “typical” $k + 1/\varepsilon^2$ elements whose average loss corresponds to the average loss of the whole dataset, up to a multiplicative $(1\pm\varepsilon)$ factor and an additive $\varepsilon \lambda \Phi_k$, where $\Phi_k$ represents the $k$-means cost for the input embeddings and $\lambda$ is the Holder constant. We furthermore demonstrate the performance and scalability of our approach on fine-tuning foundation models and show that it outperforms state-of-the-art methods. We also show how it can be applied on linear regression, leading to a new sampling strategy that surprisingly matches the performance of leverage score sampling, while being conceptually simpler and more scalable.

APA


Axiotis, K., Cohen-Addad, V., Henzinger, M., Jerome, S., Mirrokni, V., Saulpic, D., Woodruff, D. & Wunder, M.. (2024). Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:2086-2107 Available from https://proceedings.mlr.press/v235/axiotis24a.html.

Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond

Abstract

Cite this Paper

Related Material