Learning from Biased and Costly Data Sources: Minimax-optimal Data Collection under a Budget (extended abstract)

Michael O. Harding, Vikas Singh, Kirthevasan Kandasamy
Proceedings of Thirty Ninth Conference on Learning Theory, PMLR 336:3184-3184, 2026.

Abstract

Data collection is a critical component of modern statistical and machine learning pipelines, particularly when data must be gathered from multiple heterogeneous sources to study a target population of interest. In many use cases, such as medical studies or political polling, different sources incur different sampling costs. Observations often have associated group identities—for example, health markers, demographics, or political affiliations—and the relative composition of these groups may differ substantially, both among the source populations and between sources and target population. Moreover, while group proportions are often known at the source and population levels, individual group membership may only be revealed after data collection. In this work, we study multi-source data collection under a fixed budget, focusing on the estimation of population means and group-conditional means. We show that naive data collection strategies (e.g. attempting to “match” the target distribution) or relying on standard estimators (e.g. sample mean) can be highly suboptimal. Instead, we develop a sampling plan which maximizes the effective sample size—the total sample size divided by $D_{\chi^2}(q\mid\mid\overline{p}) + 1$, where $q$ is the target distribution, $\overline{p}$ is the aggregated source distribution, and $D_{\chi^2}$ is the $\chi^2$-divergence. We pair this sampling plan with a classical post-stratified estimator, which is able to leverage large but systematically biased datasets, and upper bound its risk. We provide lower bounds with exactly matching leading terms, proving that our approach achieves the budgeted minimax-optimal risk up to additive lower order terms. Our techniques also extend to prediction problems when minimizing the excess risk. In this setting, we pair the effective-sample-size-maximizing sampling plan with a weighted empirical risk minimzer and upper bound its risk. A key contribution to this end is the development of a general information-theoretic lower bound framework for prediction problems under possibly differing source and target distributions, which may be of independent interest outside of this work. We apply this framework to the case of binary classification to establish a lower bound which matches the upper bound of our proposed approach up to a $\sqrt{K/q_{\min}}$-factor, where $K$ is the number of groups and $q_{\min}$ is the smallest group identity probability under the target distribution $q\,$. Our framework enables us to respect the information geometry of the problem via dependence on $D_{\chi^2}(q\mid\mid\overline{p})$ that matches the upper bound, which was not possible with existing techniques.

Cite this Paper


BibTeX
@InProceedings{pmlr-v336-harding26a, title = {Learning from Biased and Costly Data Sources: Minimax-optimal Data Collection under a Budget (extended abstract)}, author = {Harding, Michael O. and Singh, Vikas and Kandasamy, Kirthevasan}, booktitle = {Proceedings of Thirty Ninth Conference on Learning Theory}, pages = {3184--3184}, year = {2026}, editor = {Hanneke, Steve and Lattimore, Tor}, volume = {336}, series = {Proceedings of Machine Learning Research}, month = {29 Jun--03 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v336/main/assets/harding26a/harding26a.pdf}, url = {https://proceedings.mlr.press/v336/harding26a.html}, abstract = {Data collection is a critical component of modern statistical and machine learning pipelines, particularly when data must be gathered from multiple heterogeneous sources to study a target population of interest. In many use cases, such as medical studies or political polling, different sources incur different sampling costs. Observations often have associated group identities—for example, health markers, demographics, or political affiliations—and the relative composition of these groups may differ substantially, both among the source populations and between sources and target population. Moreover, while group proportions are often known at the source and population levels, individual group membership may only be revealed after data collection. In this work, we study multi-source data collection under a fixed budget, focusing on the estimation of population means and group-conditional means. We show that naive data collection strategies (e.g. attempting to “match” the target distribution) or relying on standard estimators (e.g. sample mean) can be highly suboptimal. Instead, we develop a sampling plan which maximizes the effective sample size—the total sample size divided by $D_{\chi^2}(q\mid\mid\overline{p}) + 1$, where $q$ is the target distribution, $\overline{p}$ is the aggregated source distribution, and $D_{\chi^2}$ is the $\chi^2$-divergence. We pair this sampling plan with a classical post-stratified estimator, which is able to leverage large but systematically biased datasets, and upper bound its risk. We provide lower bounds with exactly matching leading terms, proving that our approach achieves the budgeted minimax-optimal risk up to additive lower order terms. Our techniques also extend to prediction problems when minimizing the excess risk. In this setting, we pair the effective-sample-size-maximizing sampling plan with a weighted empirical risk minimzer and upper bound its risk. A key contribution to this end is the development of a general information-theoretic lower bound framework for prediction problems under possibly differing source and target distributions, which may be of independent interest outside of this work. We apply this framework to the case of binary classification to establish a lower bound which matches the upper bound of our proposed approach up to a $\sqrt{K/q_{\min}}$-factor, where $K$ is the number of groups and $q_{\min}$ is the smallest group identity probability under the target distribution $q\,$. Our framework enables us to respect the information geometry of the problem via dependence on $D_{\chi^2}(q\mid\mid\overline{p})$ that matches the upper bound, which was not possible with existing techniques.} }
Endnote
%0 Conference Paper %T Learning from Biased and Costly Data Sources: Minimax-optimal Data Collection under a Budget (extended abstract) %A Michael O. Harding %A Vikas Singh %A Kirthevasan Kandasamy %B Proceedings of Thirty Ninth Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2026 %E Steve Hanneke %E Tor Lattimore %F pmlr-v336-harding26a %I PMLR %P 3184--3184 %U https://proceedings.mlr.press/v336/harding26a.html %V 336 %X Data collection is a critical component of modern statistical and machine learning pipelines, particularly when data must be gathered from multiple heterogeneous sources to study a target population of interest. In many use cases, such as medical studies or political polling, different sources incur different sampling costs. Observations often have associated group identities—for example, health markers, demographics, or political affiliations—and the relative composition of these groups may differ substantially, both among the source populations and between sources and target population. Moreover, while group proportions are often known at the source and population levels, individual group membership may only be revealed after data collection. In this work, we study multi-source data collection under a fixed budget, focusing on the estimation of population means and group-conditional means. We show that naive data collection strategies (e.g. attempting to “match” the target distribution) or relying on standard estimators (e.g. sample mean) can be highly suboptimal. Instead, we develop a sampling plan which maximizes the effective sample size—the total sample size divided by $D_{\chi^2}(q\mid\mid\overline{p}) + 1$, where $q$ is the target distribution, $\overline{p}$ is the aggregated source distribution, and $D_{\chi^2}$ is the $\chi^2$-divergence. We pair this sampling plan with a classical post-stratified estimator, which is able to leverage large but systematically biased datasets, and upper bound its risk. We provide lower bounds with exactly matching leading terms, proving that our approach achieves the budgeted minimax-optimal risk up to additive lower order terms. Our techniques also extend to prediction problems when minimizing the excess risk. In this setting, we pair the effective-sample-size-maximizing sampling plan with a weighted empirical risk minimzer and upper bound its risk. A key contribution to this end is the development of a general information-theoretic lower bound framework for prediction problems under possibly differing source and target distributions, which may be of independent interest outside of this work. We apply this framework to the case of binary classification to establish a lower bound which matches the upper bound of our proposed approach up to a $\sqrt{K/q_{\min}}$-factor, where $K$ is the number of groups and $q_{\min}$ is the smallest group identity probability under the target distribution $q\,$. Our framework enables us to respect the information geometry of the problem via dependence on $D_{\chi^2}(q\mid\mid\overline{p})$ that matches the upper bound, which was not possible with existing techniques.
APA
Harding, M.O., Singh, V. & Kandasamy, K.. (2026). Learning from Biased and Costly Data Sources: Minimax-optimal Data Collection under a Budget (extended abstract). Proceedings of Thirty Ninth Conference on Learning Theory, in Proceedings of Machine Learning Research 336:3184-3184 Available from https://proceedings.mlr.press/v336/harding26a.html.

Related Material