Semi-Supervised Aggregation of Dependent Weak Supervision Sources With Performance Guarantees

Alessio Mazzetto, Dylan Sam, Andrew Park, Eli Upfal, Stephen Bach
Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, PMLR 130:3196-3204, 2021.

Abstract

We develop a novel method that provides theoretical guarantees for learning from weak labelers without the (mostly unrealistic) assumption that the errors of the weak labelers are independent or come from a particular family of distributions. We show a rigorous technique for efficiently selecting small subsets of the labelers so that a majority vote from such subsets has a provably low error rate. We explore several extensions of this method and provide experimental results over a range of labeled data set sizes on 45 image classification tasks. Our performance-guaranteed methods consistently match the best performing alternative, which varies based on problem difficulty. On tasks with accurate weak labelers, our methods are on average 3 percentage points more accurate than the state-of-the-art adversarial method. On tasks with inaccurate weak labelers, our methods are on average 15 percentage points more accurate than the semi-supervised Dawid-Skene model (which assumes independence).

Cite this Paper


BibTeX
@InProceedings{pmlr-v130-mazzetto21a, title = { Semi-Supervised Aggregation of Dependent Weak Supervision Sources With Performance Guarantees }, author = {Mazzetto, Alessio and Sam, Dylan and Park, Andrew and Upfal, Eli and Bach, Stephen}, booktitle = {Proceedings of The 24th International Conference on Artificial Intelligence and Statistics}, pages = {3196--3204}, year = {2021}, editor = {Banerjee, Arindam and Fukumizu, Kenji}, volume = {130}, series = {Proceedings of Machine Learning Research}, month = {13--15 Apr}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v130/mazzetto21a/mazzetto21a.pdf}, url = {https://proceedings.mlr.press/v130/mazzetto21a.html}, abstract = { We develop a novel method that provides theoretical guarantees for learning from weak labelers without the (mostly unrealistic) assumption that the errors of the weak labelers are independent or come from a particular family of distributions. We show a rigorous technique for efficiently selecting small subsets of the labelers so that a majority vote from such subsets has a provably low error rate. We explore several extensions of this method and provide experimental results over a range of labeled data set sizes on 45 image classification tasks. Our performance-guaranteed methods consistently match the best performing alternative, which varies based on problem difficulty. On tasks with accurate weak labelers, our methods are on average 3 percentage points more accurate than the state-of-the-art adversarial method. On tasks with inaccurate weak labelers, our methods are on average 15 percentage points more accurate than the semi-supervised Dawid-Skene model (which assumes independence). } }
Endnote
%0 Conference Paper %T Semi-Supervised Aggregation of Dependent Weak Supervision Sources With Performance Guarantees %A Alessio Mazzetto %A Dylan Sam %A Andrew Park %A Eli Upfal %A Stephen Bach %B Proceedings of The 24th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2021 %E Arindam Banerjee %E Kenji Fukumizu %F pmlr-v130-mazzetto21a %I PMLR %P 3196--3204 %U https://proceedings.mlr.press/v130/mazzetto21a.html %V 130 %X We develop a novel method that provides theoretical guarantees for learning from weak labelers without the (mostly unrealistic) assumption that the errors of the weak labelers are independent or come from a particular family of distributions. We show a rigorous technique for efficiently selecting small subsets of the labelers so that a majority vote from such subsets has a provably low error rate. We explore several extensions of this method and provide experimental results over a range of labeled data set sizes on 45 image classification tasks. Our performance-guaranteed methods consistently match the best performing alternative, which varies based on problem difficulty. On tasks with accurate weak labelers, our methods are on average 3 percentage points more accurate than the state-of-the-art adversarial method. On tasks with inaccurate weak labelers, our methods are on average 15 percentage points more accurate than the semi-supervised Dawid-Skene model (which assumes independence).
APA
Mazzetto, A., Sam, D., Park, A., Upfal, E. & Bach, S.. (2021). Semi-Supervised Aggregation of Dependent Weak Supervision Sources With Performance Guarantees . Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 130:3196-3204 Available from https://proceedings.mlr.press/v130/mazzetto21a.html.

Related Material