In all likelihoods: robust selection of pseudo-labeled data

Julian Rodemann, Christoph Jansen, Georg Schollmeyer, Thomas Augustin
Proceedings of the Thirteenth International Symposium on Imprecise Probability: Theories and Applications, PMLR 215:412-425, 2023.

Abstract

Self-training is a simple yet effective method within semi-supervised learning. Self-training’s rationale is to iteratively enhance training data by adding pseudo-labeled data. Its generalization performance heavily depends on the selection of these pseudo-labeled data (PLS). In this paper, we render PLS more robust towards the involved modeling assumptions. To this end, we treat PLS as a decision problem, which allows us to introduce a generalized utility function. The idea is to select pseudo-labeled data that maximize a multi-objective utility function. We demonstrate that the latter can be constructed to account for different sources of uncertainty and explore three examples: model selection, accumulation of errors and covariate shift. In the absence of second-order information on such uncertainties, we furthermore consider the generic approach of the generalized Bayesian $\alpha$-cut updating rule for credal sets. We spotlight the application of three of our robust extensions on both simulated and three real-world data sets. In a benchmarking study, we compare these extensions to traditional PLS methods. Results suggest that robustness with regard to model choice can lead to substantial accuracy gains.

Cite this Paper


BibTeX
@InProceedings{pmlr-v215-rodemann23a, title = {In all likelihoods: robust selection of pseudo-labeled data}, author = {Rodemann, Julian and Jansen, Christoph and Schollmeyer, Georg and Augustin, Thomas}, booktitle = {Proceedings of the Thirteenth International Symposium on Imprecise Probability: Theories and Applications}, pages = {412--425}, year = {2023}, editor = {Miranda, Enrique and Montes, Ignacio and Quaeghebeur, Erik and Vantaggi, Barbara}, volume = {215}, series = {Proceedings of Machine Learning Research}, month = {11--14 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v215/rodemann23a/rodemann23a.pdf}, url = {https://proceedings.mlr.press/v215/rodemann23a.html}, abstract = {Self-training is a simple yet effective method within semi-supervised learning. Self-training’s rationale is to iteratively enhance training data by adding pseudo-labeled data. Its generalization performance heavily depends on the selection of these pseudo-labeled data (PLS). In this paper, we render PLS more robust towards the involved modeling assumptions. To this end, we treat PLS as a decision problem, which allows us to introduce a generalized utility function. The idea is to select pseudo-labeled data that maximize a multi-objective utility function. We demonstrate that the latter can be constructed to account for different sources of uncertainty and explore three examples: model selection, accumulation of errors and covariate shift. In the absence of second-order information on such uncertainties, we furthermore consider the generic approach of the generalized Bayesian $\alpha$-cut updating rule for credal sets. We spotlight the application of three of our robust extensions on both simulated and three real-world data sets. In a benchmarking study, we compare these extensions to traditional PLS methods. Results suggest that robustness with regard to model choice can lead to substantial accuracy gains.} }
Endnote
%0 Conference Paper %T In all likelihoods: robust selection of pseudo-labeled data %A Julian Rodemann %A Christoph Jansen %A Georg Schollmeyer %A Thomas Augustin %B Proceedings of the Thirteenth International Symposium on Imprecise Probability: Theories and Applications %C Proceedings of Machine Learning Research %D 2023 %E Enrique Miranda %E Ignacio Montes %E Erik Quaeghebeur %E Barbara Vantaggi %F pmlr-v215-rodemann23a %I PMLR %P 412--425 %U https://proceedings.mlr.press/v215/rodemann23a.html %V 215 %X Self-training is a simple yet effective method within semi-supervised learning. Self-training’s rationale is to iteratively enhance training data by adding pseudo-labeled data. Its generalization performance heavily depends on the selection of these pseudo-labeled data (PLS). In this paper, we render PLS more robust towards the involved modeling assumptions. To this end, we treat PLS as a decision problem, which allows us to introduce a generalized utility function. The idea is to select pseudo-labeled data that maximize a multi-objective utility function. We demonstrate that the latter can be constructed to account for different sources of uncertainty and explore three examples: model selection, accumulation of errors and covariate shift. In the absence of second-order information on such uncertainties, we furthermore consider the generic approach of the generalized Bayesian $\alpha$-cut updating rule for credal sets. We spotlight the application of three of our robust extensions on both simulated and three real-world data sets. In a benchmarking study, we compare these extensions to traditional PLS methods. Results suggest that robustness with regard to model choice can lead to substantial accuracy gains.
APA
Rodemann, J., Jansen, C., Schollmeyer, G. & Augustin, T.. (2023). In all likelihoods: robust selection of pseudo-labeled data. Proceedings of the Thirteenth International Symposium on Imprecise Probability: Theories and Applications, in Proceedings of Machine Learning Research 215:412-425 Available from https://proceedings.mlr.press/v215/rodemann23a.html.

Related Material