2D-Shapley: A Framework for Fragmented Data Valuation

Zhihong Liu, Hoang Anh Just, Xiangyu Chang, Xi Chen, Ruoxi Jia
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:21730-21755, 2023.

Abstract

Data valuation—quantifying the contribution of individual data sources to certain predictive behaviors of a model—is of great importance to enhancing the transparency of machine learning and designing incentive systems for data sharing. Existing work has focused on evaluating data sources with the shared feature or sample space. How to valuate fragmented data sources of which each only contains partial features and samples remains an open question. We start by presenting a method to calculate the counterfactual of removing a fragment from the aggregated data matrix. Based on the counterfactual calculation, we further propose 2D-Shapley, a theoretical framework for fragmented data valuation that uniquely satisfies some appealing axioms in the fragmented data context. 2D-Shapley empowers a range of new use cases, such as selecting useful data fragments, providing interpretation for sample-wise data values, and fine-grained data issue diagnosis.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-liu23s, title = {2{D}-Shapley: A Framework for Fragmented Data Valuation}, author = {Liu, Zhihong and Just, Hoang Anh and Chang, Xiangyu and Chen, Xi and Jia, Ruoxi}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {21730--21755}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/liu23s/liu23s.pdf}, url = {https://proceedings.mlr.press/v202/liu23s.html}, abstract = {Data valuation—quantifying the contribution of individual data sources to certain predictive behaviors of a model—is of great importance to enhancing the transparency of machine learning and designing incentive systems for data sharing. Existing work has focused on evaluating data sources with the shared feature or sample space. How to valuate fragmented data sources of which each only contains partial features and samples remains an open question. We start by presenting a method to calculate the counterfactual of removing a fragment from the aggregated data matrix. Based on the counterfactual calculation, we further propose 2D-Shapley, a theoretical framework for fragmented data valuation that uniquely satisfies some appealing axioms in the fragmented data context. 2D-Shapley empowers a range of new use cases, such as selecting useful data fragments, providing interpretation for sample-wise data values, and fine-grained data issue diagnosis.} }
Endnote
%0 Conference Paper %T 2D-Shapley: A Framework for Fragmented Data Valuation %A Zhihong Liu %A Hoang Anh Just %A Xiangyu Chang %A Xi Chen %A Ruoxi Jia %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-liu23s %I PMLR %P 21730--21755 %U https://proceedings.mlr.press/v202/liu23s.html %V 202 %X Data valuation—quantifying the contribution of individual data sources to certain predictive behaviors of a model—is of great importance to enhancing the transparency of machine learning and designing incentive systems for data sharing. Existing work has focused on evaluating data sources with the shared feature or sample space. How to valuate fragmented data sources of which each only contains partial features and samples remains an open question. We start by presenting a method to calculate the counterfactual of removing a fragment from the aggregated data matrix. Based on the counterfactual calculation, we further propose 2D-Shapley, a theoretical framework for fragmented data valuation that uniquely satisfies some appealing axioms in the fragmented data context. 2D-Shapley empowers a range of new use cases, such as selecting useful data fragments, providing interpretation for sample-wise data values, and fine-grained data issue diagnosis.
APA
Liu, Z., Just, H.A., Chang, X., Chen, X. & Jia, R.. (2023). 2D-Shapley: A Framework for Fragmented Data Valuation. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:21730-21755 Available from https://proceedings.mlr.press/v202/liu23s.html.

Related Material