Adaptive Data Collection for Robust Learning Across Multiple Distributions

Chengbo Zang, Mehmet Kerem Turkcan, Gil Zussman, Zoran Kostic, Javad Ghaderi
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:73974-73994, 2025.

Abstract

We propose a framework for adaptive data collection aimed at robust learning in multi-distribution scenarios under a fixed data collection budget. In each round, the algorithm selects a distribution source to sample from for data collection and updates the model parameters accordingly. The objective is to find the model parameters that minimize the expected loss across all the data sources. Our approach integrates upper-confidence-bound (UCB) sampling with online gradient descent (OGD) to dynamically collect and annotate data from multiple sources. By bridging online optimization and multi-armed bandits, we provide theoretical guarantees for our UCB-OGD approach, demonstrating that it achieves a minimax regret of $O(T^{\frac{1}{2}}(K\ln T)^{\frac{1}{2}})$ over $K$ data sources after $T$ rounds. We further provide a lower bound showing that the result is optimal up to a $\ln T$ factor. Extensive evaluations on standard datasets and a real-world testbed for object detection in smart-city intersections validate the consistent performance improvements of our method compared to baselines such as random sampling and various active learning methods.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-zang25a, title = {Adaptive Data Collection for Robust Learning Across Multiple Distributions}, author = {Zang, Chengbo and Turkcan, Mehmet Kerem and Zussman, Gil and Kostic, Zoran and Ghaderi, Javad}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {73974--73994}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/zang25a/zang25a.pdf}, url = {https://proceedings.mlr.press/v267/zang25a.html}, abstract = {We propose a framework for adaptive data collection aimed at robust learning in multi-distribution scenarios under a fixed data collection budget. In each round, the algorithm selects a distribution source to sample from for data collection and updates the model parameters accordingly. The objective is to find the model parameters that minimize the expected loss across all the data sources. Our approach integrates upper-confidence-bound (UCB) sampling with online gradient descent (OGD) to dynamically collect and annotate data from multiple sources. By bridging online optimization and multi-armed bandits, we provide theoretical guarantees for our UCB-OGD approach, demonstrating that it achieves a minimax regret of $O(T^{\frac{1}{2}}(K\ln T)^{\frac{1}{2}})$ over $K$ data sources after $T$ rounds. We further provide a lower bound showing that the result is optimal up to a $\ln T$ factor. Extensive evaluations on standard datasets and a real-world testbed for object detection in smart-city intersections validate the consistent performance improvements of our method compared to baselines such as random sampling and various active learning methods.} }
Endnote
%0 Conference Paper %T Adaptive Data Collection for Robust Learning Across Multiple Distributions %A Chengbo Zang %A Mehmet Kerem Turkcan %A Gil Zussman %A Zoran Kostic %A Javad Ghaderi %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-zang25a %I PMLR %P 73974--73994 %U https://proceedings.mlr.press/v267/zang25a.html %V 267 %X We propose a framework for adaptive data collection aimed at robust learning in multi-distribution scenarios under a fixed data collection budget. In each round, the algorithm selects a distribution source to sample from for data collection and updates the model parameters accordingly. The objective is to find the model parameters that minimize the expected loss across all the data sources. Our approach integrates upper-confidence-bound (UCB) sampling with online gradient descent (OGD) to dynamically collect and annotate data from multiple sources. By bridging online optimization and multi-armed bandits, we provide theoretical guarantees for our UCB-OGD approach, demonstrating that it achieves a minimax regret of $O(T^{\frac{1}{2}}(K\ln T)^{\frac{1}{2}})$ over $K$ data sources after $T$ rounds. We further provide a lower bound showing that the result is optimal up to a $\ln T$ factor. Extensive evaluations on standard datasets and a real-world testbed for object detection in smart-city intersections validate the consistent performance improvements of our method compared to baselines such as random sampling and various active learning methods.
APA
Zang, C., Turkcan, M.K., Zussman, G., Kostic, Z. & Ghaderi, J.. (2025). Adaptive Data Collection for Robust Learning Across Multiple Distributions. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:73974-73994 Available from https://proceedings.mlr.press/v267/zang25a.html.

Related Material