Optimal Multi-Distribution Learning

Zihan Zhang, Wenhao Zhan, Yuxin Chen, Simon S Du, Jason D Lee
Proceedings of Thirty Seventh Conference on Learning Theory, PMLR 247:5220-5223, 2024.

Abstract

Multi-distribution learning (MDL), which seeks to learn a shared model that minimizes the worst-case risk across $k$ distinct data distributions, has emerged as a unified framework in response to the evolving demand for robustness, fairness, multi-group collaboration, etc. Achieving data-efficient MDL necessitates adaptive sampling, also called on-demand sampling, throughout the learning process. However, there exist substantial gaps between the state-of-the-art upper and lower bounds on the optimal sample complexity. Focusing on a hypothesis class of Vapnik-Chervonenkis (VC) dimension $d$, we propose a novel algorithm that yields an $\varepsilon$-optimal randomized hypothesis with a sample complexity on the order of $\frac{d+k}{\varepsilon^2}$ (modulo some logarithmic factor), matching the best-known lower bound. Our algorithmic ideas and theory have been further extended to accommodate Rademacher classes. The proposed algorithms are oracle-efficient, which access the hypothesis class solely through an empirical risk minimization oracle. Additionally, we establish the necessity of randomization, unveiling a large sample size barrier when only deterministic hypotheses are permitted. These findings successfully resolve three open problems presented in COLT 2023 (i.e., Problems 1, 3 and 4 of Awasthi et al. 2023).

Cite this Paper


BibTeX
@InProceedings{pmlr-v247-zhang24b, title = {Optimal Multi-Distribution Learning}, author = {Zhang, Zihan and Zhan, Wenhao and Chen, Yuxin and Du, Simon S and Lee, Jason D}, booktitle = {Proceedings of Thirty Seventh Conference on Learning Theory}, pages = {5220--5223}, year = {2024}, editor = {Agrawal, Shipra and Roth, Aaron}, volume = {247}, series = {Proceedings of Machine Learning Research}, month = {30 Jun--03 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v247/zhang24b/zhang24b.pdf}, url = {https://proceedings.mlr.press/v247/zhang24b.html}, abstract = {Multi-distribution learning (MDL), which seeks to learn a shared model that minimizes the worst-case risk across $k$ distinct data distributions, has emerged as a unified framework in response to the evolving demand for robustness, fairness, multi-group collaboration, etc. Achieving data-efficient MDL necessitates adaptive sampling, also called on-demand sampling, throughout the learning process. However, there exist substantial gaps between the state-of-the-art upper and lower bounds on the optimal sample complexity. Focusing on a hypothesis class of Vapnik-Chervonenkis (VC) dimension $d$, we propose a novel algorithm that yields an $\varepsilon$-optimal randomized hypothesis with a sample complexity on the order of $\frac{d+k}{\varepsilon^2}$ (modulo some logarithmic factor), matching the best-known lower bound. Our algorithmic ideas and theory have been further extended to accommodate Rademacher classes. The proposed algorithms are oracle-efficient, which access the hypothesis class solely through an empirical risk minimization oracle. Additionally, we establish the necessity of randomization, unveiling a large sample size barrier when only deterministic hypotheses are permitted. These findings successfully resolve three open problems presented in COLT 2023 (i.e., Problems 1, 3 and 4 of Awasthi et al. 2023). } }
Endnote
%0 Conference Paper %T Optimal Multi-Distribution Learning %A Zihan Zhang %A Wenhao Zhan %A Yuxin Chen %A Simon S Du %A Jason D Lee %B Proceedings of Thirty Seventh Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2024 %E Shipra Agrawal %E Aaron Roth %F pmlr-v247-zhang24b %I PMLR %P 5220--5223 %U https://proceedings.mlr.press/v247/zhang24b.html %V 247 %X Multi-distribution learning (MDL), which seeks to learn a shared model that minimizes the worst-case risk across $k$ distinct data distributions, has emerged as a unified framework in response to the evolving demand for robustness, fairness, multi-group collaboration, etc. Achieving data-efficient MDL necessitates adaptive sampling, also called on-demand sampling, throughout the learning process. However, there exist substantial gaps between the state-of-the-art upper and lower bounds on the optimal sample complexity. Focusing on a hypothesis class of Vapnik-Chervonenkis (VC) dimension $d$, we propose a novel algorithm that yields an $\varepsilon$-optimal randomized hypothesis with a sample complexity on the order of $\frac{d+k}{\varepsilon^2}$ (modulo some logarithmic factor), matching the best-known lower bound. Our algorithmic ideas and theory have been further extended to accommodate Rademacher classes. The proposed algorithms are oracle-efficient, which access the hypothesis class solely through an empirical risk minimization oracle. Additionally, we establish the necessity of randomization, unveiling a large sample size barrier when only deterministic hypotheses are permitted. These findings successfully resolve three open problems presented in COLT 2023 (i.e., Problems 1, 3 and 4 of Awasthi et al. 2023).
APA
Zhang, Z., Zhan, W., Chen, Y., Du, S.S. & Lee, J.D.. (2024). Optimal Multi-Distribution Learning. Proceedings of Thirty Seventh Conference on Learning Theory, in Proceedings of Machine Learning Research 247:5220-5223 Available from https://proceedings.mlr.press/v247/zhang24b.html.

Related Material