Aggregating Data for Optimal Learning

Sushant Agarwal, Yukti Makhija, Rishi Saket, Aravindan Raghuveer
Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence, PMLR 286:1-30, 2025.

Abstract

Multiple Instance Regression (MIR) and Learning from Label Proportions (LLP) are useful learning frameworks, where the training data is partitioned into disjoint sets or bags, and only an aggregate label, i.e., bag-label for each bag is available to the learner. In the case of MIR, the bag-label is the label of an undisclosed instance from the bag, while in LLP, the bag-label is the mean of the bag’s labels. In this paper, we study for various loss functions in MIR and LLP, what is the optimal way to partition the dataset into bags such that the utility for downstream tasks like linear regression is maximized. We theoretically provide utility guarantees, and show that in each case, the optimal bagging strategy (approximately) reduces to finding an optimal clustering of the feature vectors and/or the labels with respect to natural objectives such as $k$-means. We also show that our bagging mechanisms can be made label-differentially private, incurring an additional utility error. We then generalize our results to the setting of Generalized Linear Models (GLMs). Finally, we experimentally validate our theoretical results.

Cite this Paper


BibTeX
@InProceedings{pmlr-v286-agarwal25a, title = {Aggregating Data for Optimal Learning}, author = {Agarwal, Sushant and Makhija, Yukti and Saket, Rishi and Raghuveer, Aravindan}, booktitle = {Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence}, pages = {1--30}, year = {2025}, editor = {Chiappa, Silvia and Magliacane, Sara}, volume = {286}, series = {Proceedings of Machine Learning Research}, month = {21--25 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v286/main/assets/agarwal25a/agarwal25a.pdf}, url = {https://proceedings.mlr.press/v286/agarwal25a.html}, abstract = {Multiple Instance Regression (MIR) and Learning from Label Proportions (LLP) are useful learning frameworks, where the training data is partitioned into disjoint sets or bags, and only an aggregate label, i.e., bag-label for each bag is available to the learner. In the case of MIR, the bag-label is the label of an undisclosed instance from the bag, while in LLP, the bag-label is the mean of the bag’s labels. In this paper, we study for various loss functions in MIR and LLP, what is the optimal way to partition the dataset into bags such that the utility for downstream tasks like linear regression is maximized. We theoretically provide utility guarantees, and show that in each case, the optimal bagging strategy (approximately) reduces to finding an optimal clustering of the feature vectors and/or the labels with respect to natural objectives such as $k$-means. We also show that our bagging mechanisms can be made label-differentially private, incurring an additional utility error. We then generalize our results to the setting of Generalized Linear Models (GLMs). Finally, we experimentally validate our theoretical results.} }
Endnote
%0 Conference Paper %T Aggregating Data for Optimal Learning %A Sushant Agarwal %A Yukti Makhija %A Rishi Saket %A Aravindan Raghuveer %B Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence %C Proceedings of Machine Learning Research %D 2025 %E Silvia Chiappa %E Sara Magliacane %F pmlr-v286-agarwal25a %I PMLR %P 1--30 %U https://proceedings.mlr.press/v286/agarwal25a.html %V 286 %X Multiple Instance Regression (MIR) and Learning from Label Proportions (LLP) are useful learning frameworks, where the training data is partitioned into disjoint sets or bags, and only an aggregate label, i.e., bag-label for each bag is available to the learner. In the case of MIR, the bag-label is the label of an undisclosed instance from the bag, while in LLP, the bag-label is the mean of the bag’s labels. In this paper, we study for various loss functions in MIR and LLP, what is the optimal way to partition the dataset into bags such that the utility for downstream tasks like linear regression is maximized. We theoretically provide utility guarantees, and show that in each case, the optimal bagging strategy (approximately) reduces to finding an optimal clustering of the feature vectors and/or the labels with respect to natural objectives such as $k$-means. We also show that our bagging mechanisms can be made label-differentially private, incurring an additional utility error. We then generalize our results to the setting of Generalized Linear Models (GLMs). Finally, we experimentally validate our theoretical results.
APA
Agarwal, S., Makhija, Y., Saket, R. & Raghuveer, A.. (2025). Aggregating Data for Optimal Learning. Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence, in Proceedings of Machine Learning Research 286:1-30 Available from https://proceedings.mlr.press/v286/agarwal25a.html.

Related Material