Domain-wise Data Acquisition to Improve Performance under Distribution Shift

Yue He, Dongbai Li, Pengfei Tian, Han Yu, Jiashuo Liu, Hao Zou, Peng Cui
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:17934-17945, 2024.

Abstract

Despite notable progress in enhancing the capability of machine learning against distribution shifts, training data quality remains a bottleneck for cross-distribution generalization. Recently, from a data-centric perspective, there have been considerable efforts to improve model performance through refining the preparation of training data. Inspired by realistic scenarios, this paper addresses a practical requirement of acquiring training samples from various domains on a limited budget to facilitate model generalization to target test domain with distribution shift. Our empirical evidence indicates that the advance in data acquisition can significantly benefit the model performance on shifted data. Additionally, by leveraging unlabeled test domain data, we introduce a Domain-wise Active Acquisition framework. This framework iteratively optimizes the data acquisition strategy as training samples are accumulated, theoretically ensuring the effective approximation of test distribution. Extensive real-world experiments demonstrate our proposal’s advantages in machine learning applications. The code is available at https://github.com/dongbaili/DAA.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-he24f, title = {Domain-wise Data Acquisition to Improve Performance under Distribution Shift}, author = {He, Yue and Li, Dongbai and Tian, Pengfei and Yu, Han and Liu, Jiashuo and Zou, Hao and Cui, Peng}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {17934--17945}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/he24f/he24f.pdf}, url = {https://proceedings.mlr.press/v235/he24f.html}, abstract = {Despite notable progress in enhancing the capability of machine learning against distribution shifts, training data quality remains a bottleneck for cross-distribution generalization. Recently, from a data-centric perspective, there have been considerable efforts to improve model performance through refining the preparation of training data. Inspired by realistic scenarios, this paper addresses a practical requirement of acquiring training samples from various domains on a limited budget to facilitate model generalization to target test domain with distribution shift. Our empirical evidence indicates that the advance in data acquisition can significantly benefit the model performance on shifted data. Additionally, by leveraging unlabeled test domain data, we introduce a Domain-wise Active Acquisition framework. This framework iteratively optimizes the data acquisition strategy as training samples are accumulated, theoretically ensuring the effective approximation of test distribution. Extensive real-world experiments demonstrate our proposal’s advantages in machine learning applications. The code is available at https://github.com/dongbaili/DAA.} }
Endnote
%0 Conference Paper %T Domain-wise Data Acquisition to Improve Performance under Distribution Shift %A Yue He %A Dongbai Li %A Pengfei Tian %A Han Yu %A Jiashuo Liu %A Hao Zou %A Peng Cui %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-he24f %I PMLR %P 17934--17945 %U https://proceedings.mlr.press/v235/he24f.html %V 235 %X Despite notable progress in enhancing the capability of machine learning against distribution shifts, training data quality remains a bottleneck for cross-distribution generalization. Recently, from a data-centric perspective, there have been considerable efforts to improve model performance through refining the preparation of training data. Inspired by realistic scenarios, this paper addresses a practical requirement of acquiring training samples from various domains on a limited budget to facilitate model generalization to target test domain with distribution shift. Our empirical evidence indicates that the advance in data acquisition can significantly benefit the model performance on shifted data. Additionally, by leveraging unlabeled test domain data, we introduce a Domain-wise Active Acquisition framework. This framework iteratively optimizes the data acquisition strategy as training samples are accumulated, theoretically ensuring the effective approximation of test distribution. Extensive real-world experiments demonstrate our proposal’s advantages in machine learning applications. The code is available at https://github.com/dongbaili/DAA.
APA
He, Y., Li, D., Tian, P., Yu, H., Liu, J., Zou, H. & Cui, P.. (2024). Domain-wise Data Acquisition to Improve Performance under Distribution Shift. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:17934-17945 Available from https://proceedings.mlr.press/v235/he24f.html.

Related Material