A Theoretical Characterization of Semi-supervised Learning with Self-training for Gaussian Mixture Models

Samet Oymak, Talha Cihad Gulcu
Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, PMLR 130:3601-3609, 2021.

Abstract

Self-training is a classical approach in semi-supervised learning which is successfully applied to a variety of machine learning problems. Self-training algorithms generate pseudo-labels for the unlabeled examples and progressively refine these pseudo-labels which hopefully coincides with the actual labels. This work provides theoretical insights into self-training algorithms with a focus on linear classifiers. First, we provide a sample complexity analysis for Gaussian mixture models with two components. This is established by sharp non-asymptotic characterization of the self-training iterations which captures the evolution of the model accuracy in terms of a fixed-point iteration. Our analysis reveals the provable benefits of rejecting samples with low confidence and demonstrates how self-training iterations can gracefully improve the model accuracy. Secondly, we study a generalized GMM where the component means follow a distribution. We demonstrate that ridge regularization and class margin (i.e. separation between the component means) is crucial for the success and lack of regularization may prevent self-training from identifying the core features in the data.

Cite this Paper


BibTeX
@InProceedings{pmlr-v130-oymak21a, title = { A Theoretical Characterization of Semi-supervised Learning with Self-training for Gaussian Mixture Models }, author = {Oymak, Samet and Cihad Gulcu, Talha}, booktitle = {Proceedings of The 24th International Conference on Artificial Intelligence and Statistics}, pages = {3601--3609}, year = {2021}, editor = {Banerjee, Arindam and Fukumizu, Kenji}, volume = {130}, series = {Proceedings of Machine Learning Research}, month = {13--15 Apr}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v130/oymak21a/oymak21a.pdf}, url = {https://proceedings.mlr.press/v130/oymak21a.html}, abstract = { Self-training is a classical approach in semi-supervised learning which is successfully applied to a variety of machine learning problems. Self-training algorithms generate pseudo-labels for the unlabeled examples and progressively refine these pseudo-labels which hopefully coincides with the actual labels. This work provides theoretical insights into self-training algorithms with a focus on linear classifiers. First, we provide a sample complexity analysis for Gaussian mixture models with two components. This is established by sharp non-asymptotic characterization of the self-training iterations which captures the evolution of the model accuracy in terms of a fixed-point iteration. Our analysis reveals the provable benefits of rejecting samples with low confidence and demonstrates how self-training iterations can gracefully improve the model accuracy. Secondly, we study a generalized GMM where the component means follow a distribution. We demonstrate that ridge regularization and class margin (i.e. separation between the component means) is crucial for the success and lack of regularization may prevent self-training from identifying the core features in the data. } }
Endnote
%0 Conference Paper %T A Theoretical Characterization of Semi-supervised Learning with Self-training for Gaussian Mixture Models %A Samet Oymak %A Talha Cihad Gulcu %B Proceedings of The 24th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2021 %E Arindam Banerjee %E Kenji Fukumizu %F pmlr-v130-oymak21a %I PMLR %P 3601--3609 %U https://proceedings.mlr.press/v130/oymak21a.html %V 130 %X Self-training is a classical approach in semi-supervised learning which is successfully applied to a variety of machine learning problems. Self-training algorithms generate pseudo-labels for the unlabeled examples and progressively refine these pseudo-labels which hopefully coincides with the actual labels. This work provides theoretical insights into self-training algorithms with a focus on linear classifiers. First, we provide a sample complexity analysis for Gaussian mixture models with two components. This is established by sharp non-asymptotic characterization of the self-training iterations which captures the evolution of the model accuracy in terms of a fixed-point iteration. Our analysis reveals the provable benefits of rejecting samples with low confidence and demonstrates how self-training iterations can gracefully improve the model accuracy. Secondly, we study a generalized GMM where the component means follow a distribution. We demonstrate that ridge regularization and class margin (i.e. separation between the component means) is crucial for the success and lack of regularization may prevent self-training from identifying the core features in the data.
APA
Oymak, S. & Cihad Gulcu, T.. (2021). A Theoretical Characterization of Semi-supervised Learning with Self-training for Gaussian Mixture Models . Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 130:3601-3609 Available from https://proceedings.mlr.press/v130/oymak21a.html.

Related Material