Comparing the Value of Labeled and Unlabeled Data in Method-of-Moments Latent Variable Estimation

Mayee Chen, Benjamin Cohen-Wang, Stephen Mussmann, Frederic Sala, Christopher Re
Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, PMLR 130:3286-3294, 2021.

Abstract

Labeling data for modern machine learning is expensive and time-consuming. Latent variable models can be used to infer labels from weaker, easier-to-acquire sources operating on unlabeled data. Such models can also be trained using labeled data, presenting a key question: should a user invest in few labeled or many unlabeled points? We answer this via a framework centered on model misspecification in method-of-moments latent variable estimation. Our core result is a bias-variance decomposition of the generalization error, which shows that the unlabeled-only approach incurs additional bias under misspecification. We then introduce a correction that provably removes this bias in certain cases. We apply our decomposition framework to three scenarios—well-specified, misspecified, and corrected models—to 1) choose between labeled and unlabeled data and 2) learn from their combination. We observe theoretically and with synthetic experiments that for well-specified models, labeled points are worth a constant factor more than unlabeled points. With misspecification, however, their relative value is higher due to the additional bias but can be reduced with correction. We also apply our approach to study real-world weak supervision techniques for dataset construction.

Cite this Paper


BibTeX
@InProceedings{pmlr-v130-chen21g, title = { Comparing the Value of Labeled and Unlabeled Data in Method-of-Moments Latent Variable Estimation }, author = {Chen, Mayee and Cohen-Wang, Benjamin and Mussmann, Stephen and Sala, Frederic and Re, Christopher}, booktitle = {Proceedings of The 24th International Conference on Artificial Intelligence and Statistics}, pages = {3286--3294}, year = {2021}, editor = {Banerjee, Arindam and Fukumizu, Kenji}, volume = {130}, series = {Proceedings of Machine Learning Research}, month = {13--15 Apr}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v130/chen21g/chen21g.pdf}, url = {http://proceedings.mlr.press/v130/chen21g.html}, abstract = { Labeling data for modern machine learning is expensive and time-consuming. Latent variable models can be used to infer labels from weaker, easier-to-acquire sources operating on unlabeled data. Such models can also be trained using labeled data, presenting a key question: should a user invest in few labeled or many unlabeled points? We answer this via a framework centered on model misspecification in method-of-moments latent variable estimation. Our core result is a bias-variance decomposition of the generalization error, which shows that the unlabeled-only approach incurs additional bias under misspecification. We then introduce a correction that provably removes this bias in certain cases. We apply our decomposition framework to three scenarios—well-specified, misspecified, and corrected models—to 1) choose between labeled and unlabeled data and 2) learn from their combination. We observe theoretically and with synthetic experiments that for well-specified models, labeled points are worth a constant factor more than unlabeled points. With misspecification, however, their relative value is higher due to the additional bias but can be reduced with correction. We also apply our approach to study real-world weak supervision techniques for dataset construction. } }
Endnote
%0 Conference Paper %T Comparing the Value of Labeled and Unlabeled Data in Method-of-Moments Latent Variable Estimation %A Mayee Chen %A Benjamin Cohen-Wang %A Stephen Mussmann %A Frederic Sala %A Christopher Re %B Proceedings of The 24th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2021 %E Arindam Banerjee %E Kenji Fukumizu %F pmlr-v130-chen21g %I PMLR %P 3286--3294 %U http://proceedings.mlr.press/v130/chen21g.html %V 130 %X Labeling data for modern machine learning is expensive and time-consuming. Latent variable models can be used to infer labels from weaker, easier-to-acquire sources operating on unlabeled data. Such models can also be trained using labeled data, presenting a key question: should a user invest in few labeled or many unlabeled points? We answer this via a framework centered on model misspecification in method-of-moments latent variable estimation. Our core result is a bias-variance decomposition of the generalization error, which shows that the unlabeled-only approach incurs additional bias under misspecification. We then introduce a correction that provably removes this bias in certain cases. We apply our decomposition framework to three scenarios—well-specified, misspecified, and corrected models—to 1) choose between labeled and unlabeled data and 2) learn from their combination. We observe theoretically and with synthetic experiments that for well-specified models, labeled points are worth a constant factor more than unlabeled points. With misspecification, however, their relative value is higher due to the additional bias but can be reduced with correction. We also apply our approach to study real-world weak supervision techniques for dataset construction.
APA
Chen, M., Cohen-Wang, B., Mussmann, S., Sala, F. & Re, C.. (2021). Comparing the Value of Labeled and Unlabeled Data in Method-of-Moments Latent Variable Estimation . Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 130:3286-3294 Available from http://proceedings.mlr.press/v130/chen21g.html.

Related Material