Are labels informative in semi-supervised learning? Estimating and leveraging the missing-data mechanism.

Aude Sportisse, Hugo Schmutz, Olivier Humbert, Charles Bouveyron, Pierre-Alexandre Mattei
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:32521-32539, 2023.

Abstract

Semi-supervised learning is a powerful technique for leveraging unlabeled data to improve machine learning models, but it can be affected by the presence of “informative" labels, which occur when some classes are more likely to be labeled than others. In the missing data literature, such labels are called missing not at random. In this paper, we propose a novel approach to address this issue by estimating the missing-data mechanism and using inverse propensity weighting to debias any SSL algorithm, including those using data augmentation. We also propose a likelihood ratio test to assess whether or not labels are indeed informative. Finally, we demonstrate the performance of the proposed methods on different datasets, in particular on two medical datasets for which we design pseudo-realistic missing data scenarios.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-sportisse23a, title = {Are labels informative in semi-supervised learning? {E}stimating and leveraging the missing-data mechanism.}, author = {Sportisse, Aude and Schmutz, Hugo and Humbert, Olivier and Bouveyron, Charles and Mattei, Pierre-Alexandre}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {32521--32539}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/sportisse23a/sportisse23a.pdf}, url = {https://proceedings.mlr.press/v202/sportisse23a.html}, abstract = {Semi-supervised learning is a powerful technique for leveraging unlabeled data to improve machine learning models, but it can be affected by the presence of “informative" labels, which occur when some classes are more likely to be labeled than others. In the missing data literature, such labels are called missing not at random. In this paper, we propose a novel approach to address this issue by estimating the missing-data mechanism and using inverse propensity weighting to debias any SSL algorithm, including those using data augmentation. We also propose a likelihood ratio test to assess whether or not labels are indeed informative. Finally, we demonstrate the performance of the proposed methods on different datasets, in particular on two medical datasets for which we design pseudo-realistic missing data scenarios.} }
Endnote
%0 Conference Paper %T Are labels informative in semi-supervised learning? Estimating and leveraging the missing-data mechanism. %A Aude Sportisse %A Hugo Schmutz %A Olivier Humbert %A Charles Bouveyron %A Pierre-Alexandre Mattei %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-sportisse23a %I PMLR %P 32521--32539 %U https://proceedings.mlr.press/v202/sportisse23a.html %V 202 %X Semi-supervised learning is a powerful technique for leveraging unlabeled data to improve machine learning models, but it can be affected by the presence of “informative" labels, which occur when some classes are more likely to be labeled than others. In the missing data literature, such labels are called missing not at random. In this paper, we propose a novel approach to address this issue by estimating the missing-data mechanism and using inverse propensity weighting to debias any SSL algorithm, including those using data augmentation. We also propose a likelihood ratio test to assess whether or not labels are indeed informative. Finally, we demonstrate the performance of the proposed methods on different datasets, in particular on two medical datasets for which we design pseudo-realistic missing data scenarios.
APA
Sportisse, A., Schmutz, H., Humbert, O., Bouveyron, C. & Mattei, P.. (2023). Are labels informative in semi-supervised learning? Estimating and leveraging the missing-data mechanism.. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:32521-32539 Available from https://proceedings.mlr.press/v202/sportisse23a.html.

Related Material