Representation Matters: Assessing the Importance of Subgroup Allocations in Training Data

Esther Rolf, Theodora T Worledge, Benjamin Recht, Michael Jordan
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:9040-9051, 2021.

Abstract

Collecting more diverse and representative training data is often touted as a remedy for the disparate performance of machine learning predictors across subpopulations. However, a precise framework for understanding how dataset properties like diversity affect learning outcomes is largely lacking. By casting data collection as part of the learning process, we demonstrate that diverse representation in training data is key not only to increasing subgroup performances, but also to achieving population-level objectives. Our analysis and experiments describe how dataset compositions influence performance and provide constructive results for using trends in existing data, alongside domain knowledge, to help guide intentional, objective-aware dataset design

Cite this Paper


BibTeX
@InProceedings{pmlr-v139-rolf21a, title = {Representation Matters: Assessing the Importance of Subgroup Allocations in Training Data}, author = {Rolf, Esther and Worledge, Theodora T and Recht, Benjamin and Jordan, Michael}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {9040--9051}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v139/rolf21a/rolf21a.pdf}, url = {https://proceedings.mlr.press/v139/rolf21a.html}, abstract = {Collecting more diverse and representative training data is often touted as a remedy for the disparate performance of machine learning predictors across subpopulations. However, a precise framework for understanding how dataset properties like diversity affect learning outcomes is largely lacking. By casting data collection as part of the learning process, we demonstrate that diverse representation in training data is key not only to increasing subgroup performances, but also to achieving population-level objectives. Our analysis and experiments describe how dataset compositions influence performance and provide constructive results for using trends in existing data, alongside domain knowledge, to help guide intentional, objective-aware dataset design} }
Endnote
%0 Conference Paper %T Representation Matters: Assessing the Importance of Subgroup Allocations in Training Data %A Esther Rolf %A Theodora T Worledge %A Benjamin Recht %A Michael Jordan %B Proceedings of the 38th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Marina Meila %E Tong Zhang %F pmlr-v139-rolf21a %I PMLR %P 9040--9051 %U https://proceedings.mlr.press/v139/rolf21a.html %V 139 %X Collecting more diverse and representative training data is often touted as a remedy for the disparate performance of machine learning predictors across subpopulations. However, a precise framework for understanding how dataset properties like diversity affect learning outcomes is largely lacking. By casting data collection as part of the learning process, we demonstrate that diverse representation in training data is key not only to increasing subgroup performances, but also to achieving population-level objectives. Our analysis and experiments describe how dataset compositions influence performance and provide constructive results for using trends in existing data, alongside domain knowledge, to help guide intentional, objective-aware dataset design
APA
Rolf, E., Worledge, T.T., Recht, B. & Jordan, M.. (2021). Representation Matters: Assessing the Importance of Subgroup Allocations in Training Data. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:9040-9051 Available from https://proceedings.mlr.press/v139/rolf21a.html.

Related Material