Distributionally Robust Data Valuation

Xiaoqiang Lin, Xinyi Xu, Zhaoxuan Wu, See-Kiong Ng, Bryan Kian Hsiang Low
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:30362-30391, 2024.

Abstract

Data valuation quantifies the contribution of each data point to the performance of a machine learning model. Existing works typically define the value of data by its improvement of the validation performance of the trained model. However, this approach can be impractical to apply in collaborative machine learning and data marketplace since it is difficult for the parties/buyers to agree on a common validation dataset or determine the exact validation distribution a priori. To address this, we propose a distributionally robust data valuation approach to perform data valuation without known/fixed validation distributions. Our approach defines the value of data by its improvement of the distributionally robust generalization error (DRGE), thus providing a worst-case performance guarantee without a known/fixed validation distribution. However, since computing DRGE directly is infeasible, we propose using model deviation as a proxy for the marginal improvement of DRGE (for kernel regression and neural networks) to compute data values. Furthermore, we identify a notion of uniqueness where low uniqueness characterizes low-value data. We empirically demonstrate that our approach outperforms existing data valuation approaches in data selection and data removal tasks on real-world datasets (e.g., housing price prediction, diabetes hospitalization prediction).

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-lin24t, title = {Distributionally Robust Data Valuation}, author = {Lin, Xiaoqiang and Xu, Xinyi and Wu, Zhaoxuan and Ng, See-Kiong and Low, Bryan Kian Hsiang}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {30362--30391}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/lin24t/lin24t.pdf}, url = {https://proceedings.mlr.press/v235/lin24t.html}, abstract = {Data valuation quantifies the contribution of each data point to the performance of a machine learning model. Existing works typically define the value of data by its improvement of the validation performance of the trained model. However, this approach can be impractical to apply in collaborative machine learning and data marketplace since it is difficult for the parties/buyers to agree on a common validation dataset or determine the exact validation distribution a priori. To address this, we propose a distributionally robust data valuation approach to perform data valuation without known/fixed validation distributions. Our approach defines the value of data by its improvement of the distributionally robust generalization error (DRGE), thus providing a worst-case performance guarantee without a known/fixed validation distribution. However, since computing DRGE directly is infeasible, we propose using model deviation as a proxy for the marginal improvement of DRGE (for kernel regression and neural networks) to compute data values. Furthermore, we identify a notion of uniqueness where low uniqueness characterizes low-value data. We empirically demonstrate that our approach outperforms existing data valuation approaches in data selection and data removal tasks on real-world datasets (e.g., housing price prediction, diabetes hospitalization prediction).} }
Endnote
%0 Conference Paper %T Distributionally Robust Data Valuation %A Xiaoqiang Lin %A Xinyi Xu %A Zhaoxuan Wu %A See-Kiong Ng %A Bryan Kian Hsiang Low %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-lin24t %I PMLR %P 30362--30391 %U https://proceedings.mlr.press/v235/lin24t.html %V 235 %X Data valuation quantifies the contribution of each data point to the performance of a machine learning model. Existing works typically define the value of data by its improvement of the validation performance of the trained model. However, this approach can be impractical to apply in collaborative machine learning and data marketplace since it is difficult for the parties/buyers to agree on a common validation dataset or determine the exact validation distribution a priori. To address this, we propose a distributionally robust data valuation approach to perform data valuation without known/fixed validation distributions. Our approach defines the value of data by its improvement of the distributionally robust generalization error (DRGE), thus providing a worst-case performance guarantee without a known/fixed validation distribution. However, since computing DRGE directly is infeasible, we propose using model deviation as a proxy for the marginal improvement of DRGE (for kernel regression and neural networks) to compute data values. Furthermore, we identify a notion of uniqueness where low uniqueness characterizes low-value data. We empirically demonstrate that our approach outperforms existing data valuation approaches in data selection and data removal tasks on real-world datasets (e.g., housing price prediction, diabetes hospitalization prediction).
APA
Lin, X., Xu, X., Wu, Z., Ng, S. & Low, B.K.H.. (2024). Distributionally Robust Data Valuation. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:30362-30391 Available from https://proceedings.mlr.press/v235/lin24t.html.

Related Material