Rethinking Data Shapley for Data Selection Tasks: Misleads and Merits

Jiachen T. Wang; Tianji Yang; James Zou; Yongchan Kwon; Ruoxi Jia

Rethinking Data Shapley for Data Selection Tasks: Misleads and Merits

Jiachen T. Wang, Tianji Yang, James Zou, Yongchan Kwon, Ruoxi Jia

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:52033-52063, 2024.

Abstract

Data Shapley provides a principled approach to data valuation and plays a crucial role in data-centric machine learning (ML) research. Data selection is considered a standard application of Data Shapley. However, its data selection performance has shown to be inconsistent across settings in the literature. This study aims to deepen our understanding of this phenomenon. We introduce a hypothesis testing framework and show that Data Shapley’s performance can be no better than random selection without specific constraints on utility functions. We identify a class of utility functions, monotonically transformed modular functions, within which Data Shapley optimally selects data. Based on this insight, we propose a heuristic for predicting Data Shapley’s effectiveness in data selection tasks. Our experiments corroborate these findings, adding new insights into when Data Shapley may or may not succeed.

Cite this Paper

BibTeX

@InProceedings{pmlr-v235-wang24cg,
  title = 	 {Rethinking Data Shapley for Data Selection Tasks: Misleads and Merits},
  author =       {Wang, Jiachen T. and Yang, Tianji and Zou, James and Kwon, Yongchan and Jia, Ruoxi},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {52033--52063},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/wang24cg/wang24cg.pdf},
  url = 	 {https://proceedings.mlr.press/v235/wang24cg.html},
  abstract = 	 {Data Shapley provides a principled approach to data valuation and plays a crucial role in data-centric machine learning (ML) research. Data selection is considered a standard application of Data Shapley. However, its data selection performance has shown to be inconsistent across settings in the literature. This study aims to deepen our understanding of this phenomenon. We introduce a hypothesis testing framework and show that Data Shapley’s performance can be no better than random selection without specific constraints on utility functions. We identify a class of utility functions, monotonically transformed modular functions, within which Data Shapley optimally selects data. Based on this insight, we propose a heuristic for predicting Data Shapley’s effectiveness in data selection tasks. Our experiments corroborate these findings, adding new insights into when Data Shapley may or may not succeed.}
}

Endnote

%0 Conference Paper
%T Rethinking Data Shapley for Data Selection Tasks: Misleads and Merits
%A Jiachen T. Wang
%A Tianji Yang
%A James Zou
%A Yongchan Kwon
%A Ruoxi Jia
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-wang24cg
%I PMLR
%P 52033--52063
%U https://proceedings.mlr.press/v235/wang24cg.html
%V 235
%X Data Shapley provides a principled approach to data valuation and plays a crucial role in data-centric machine learning (ML) research. Data selection is considered a standard application of Data Shapley. However, its data selection performance has shown to be inconsistent across settings in the literature. This study aims to deepen our understanding of this phenomenon. We introduce a hypothesis testing framework and show that Data Shapley’s performance can be no better than random selection without specific constraints on utility functions. We identify a class of utility functions, monotonically transformed modular functions, within which Data Shapley optimally selects data. Based on this insight, we propose a heuristic for predicting Data Shapley’s effectiveness in data selection tasks. Our experiments corroborate these findings, adding new insights into when Data Shapley may or may not succeed.

APA

Wang, J.T., Yang, T., Zou, J., Kwon, Y. & Jia, R.. (2024). Rethinking Data Shapley for Data Selection Tasks: Misleads and Merits. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:52033-52063 Available from https://proceedings.mlr.press/v235/wang24cg.html.

Rethinking Data Shapley for Data Selection Tasks: Misleads and Merits

Abstract

Cite this Paper

Related Material