Fast and Three-rious: Speeding Up Weak Supervision with Triplet Methods

Daniel Fu, Mayee Chen, Frederic Sala, Sarah Hooper, Kayvon Fatahalian, Christopher Re
Proceedings of the 37th International Conference on Machine Learning, PMLR 119:3280-3291, 2020.

Abstract

Weak supervision is a popular method for building machine learning models without relying on ground truth annotations. Instead, it generates probabilistic training labels by estimating the accuracies of multiple noisy labeling sources (e.g., heuristics, crowd workers). Existing approaches use latent variable estimation to model the noisy sources, but these methods can be computationally expensive, scaling superlinearly in the data. In this work, we show that, for a class of latent variable models highly applicable to weak supervision, we can find a closed-form solution to model parameters, obviating the need for iterative solutions like stochastic gradient descent (SGD). We use this insight to build FlyingSquid, a weak supervision framework that runs orders of magnitude faster than previous weak supervision approaches and requires fewer assumptions. In particular, we prove bounds on generalization error without assuming that the latent variable model can exactly parameterize the underlying data distribution. Empirically, we validate FlyingSquid on benchmark weak supervision datasets and find that it achieves the same or higher quality compared to previous approaches without the need to tune an SGD procedure, recovers model parameters 170 times faster on average, and enables new video analysis and online learning applications.

Cite this Paper


BibTeX
@InProceedings{pmlr-v119-fu20a, title = {Fast and Three-rious: Speeding Up Weak Supervision with Triplet Methods}, author = {Fu, Daniel and Chen, Mayee and Sala, Frederic and Hooper, Sarah and Fatahalian, Kayvon and Re, Christopher}, booktitle = {Proceedings of the 37th International Conference on Machine Learning}, pages = {3280--3291}, year = {2020}, editor = {III, Hal Daumé and Singh, Aarti}, volume = {119}, series = {Proceedings of Machine Learning Research}, month = {13--18 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v119/fu20a/fu20a.pdf}, url = {http://proceedings.mlr.press/v119/fu20a.html}, abstract = {Weak supervision is a popular method for building machine learning models without relying on ground truth annotations. Instead, it generates probabilistic training labels by estimating the accuracies of multiple noisy labeling sources (e.g., heuristics, crowd workers). Existing approaches use latent variable estimation to model the noisy sources, but these methods can be computationally expensive, scaling superlinearly in the data. In this work, we show that, for a class of latent variable models highly applicable to weak supervision, we can find a closed-form solution to model parameters, obviating the need for iterative solutions like stochastic gradient descent (SGD). We use this insight to build FlyingSquid, a weak supervision framework that runs orders of magnitude faster than previous weak supervision approaches and requires fewer assumptions. In particular, we prove bounds on generalization error without assuming that the latent variable model can exactly parameterize the underlying data distribution. Empirically, we validate FlyingSquid on benchmark weak supervision datasets and find that it achieves the same or higher quality compared to previous approaches without the need to tune an SGD procedure, recovers model parameters 170 times faster on average, and enables new video analysis and online learning applications.} }
Endnote
%0 Conference Paper %T Fast and Three-rious: Speeding Up Weak Supervision with Triplet Methods %A Daniel Fu %A Mayee Chen %A Frederic Sala %A Sarah Hooper %A Kayvon Fatahalian %A Christopher Re %B Proceedings of the 37th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2020 %E Hal Daumé III %E Aarti Singh %F pmlr-v119-fu20a %I PMLR %P 3280--3291 %U http://proceedings.mlr.press/v119/fu20a.html %V 119 %X Weak supervision is a popular method for building machine learning models without relying on ground truth annotations. Instead, it generates probabilistic training labels by estimating the accuracies of multiple noisy labeling sources (e.g., heuristics, crowd workers). Existing approaches use latent variable estimation to model the noisy sources, but these methods can be computationally expensive, scaling superlinearly in the data. In this work, we show that, for a class of latent variable models highly applicable to weak supervision, we can find a closed-form solution to model parameters, obviating the need for iterative solutions like stochastic gradient descent (SGD). We use this insight to build FlyingSquid, a weak supervision framework that runs orders of magnitude faster than previous weak supervision approaches and requires fewer assumptions. In particular, we prove bounds on generalization error without assuming that the latent variable model can exactly parameterize the underlying data distribution. Empirically, we validate FlyingSquid on benchmark weak supervision datasets and find that it achieves the same or higher quality compared to previous approaches without the need to tune an SGD procedure, recovers model parameters 170 times faster on average, and enables new video analysis and online learning applications.
APA
Fu, D., Chen, M., Sala, F., Hooper, S., Fatahalian, K. & Re, C.. (2020). Fast and Three-rious: Speeding Up Weak Supervision with Triplet Methods. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:3280-3291 Available from http://proceedings.mlr.press/v119/fu20a.html.

Related Material