Continuous Soft Pseudo-Labeling in ASR

Tatiana Likhomanenko; Ronan Collobert; Navdeep Jaitly; Samy Bengio

Continuous Soft Pseudo-Labeling in ASR

Tatiana Likhomanenko, Ronan Collobert, Navdeep Jaitly, Samy Bengio

Proceedings on "I Can't Believe It's Not Better! - Understanding Deep Learning Through Empirical Falsification" at NeurIPS 2022 Workshops, PMLR 187:66-84, 2023.

Abstract

Continuous pseudo-labeling (PL) algorithms such as slimIPL have recently emerged as a powerful strategy for semi-supervised learning in speech recognition. In contrast with earlier strategies that alternated between training a model and generating pseudo-labels (PLs) with it, here PLs are generated in end-to-end manner as training proceeds, improving training speed and the accuracy of the final model. PL shares a common theme with teacher-student models such as distillation in that a teacher model generates targets that need to be mimicked by the student model being trained. However, interestingly, PL strategies in general use hard-labels, whereas distillation uses the distribution over labels as the target to mimic. Inspired by distillation we expect that specifying the whole distribution (aka soft-labels) over sequences as the target for unlabeled data, instead of a single best pass pseudo-labeled transcript (hard-labels) should improve PL performance and convergence. Surprisingly and unexpectedly, we find that soft-labels targets can lead to training divergence, with the model collapsing to a degenerate token distribution per frame. We hypothesize that the reason this does not happen with hard-labels is that training loss on hard-labels imposes sequence-level consistency that keeps the model from collapsing to the degenerate solution. In this paper, we show several experiments that support this hypothesis, and experiment with several regularization approaches that can ameliorate the degenerate collapse when using soft-labels. These approaches can bring the accuracy of soft-labels closer to that of hard-labels, and while they are unable to outperform them yet, they serve as a useful framework for further improvements.

Cite this Paper

BibTeX


@InProceedings{pmlr-v187-likhomanenko23a,
  title = 	 {Continuous Soft Pseudo-Labeling in ASR },
  author =       {Likhomanenko, Tatiana and Collobert, Ronan and Jaitly, Navdeep and Bengio, Samy},
  booktitle = 	 {Proceedings on "I Can't Believe It's Not Better!  - Understanding Deep Learning Through Empirical Falsification" at NeurIPS 2022 Workshops},
  pages = 	 {66--84},
  year = 	 {2023},
  editor = 	 {Antorán, Javier and Blaas, Arno and Feng, Fan and Ghalebikesabi, Sahra and Mason, Ian and Pradier, Melanie F. and Rohde, David and Ruiz, Francisco J. R. and Schein, Aaron},
  volume = 	 {187},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {03 Dec},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v187/likhomanenko23a/likhomanenko23a.pdf},
  url = 	 {https://proceedings.mlr.press/v187/likhomanenko23a.html},
  abstract = 	 {Continuous pseudo-labeling (PL) algorithms such as slimIPL have recently emerged as a powerful strategy for semi-supervised learning in speech recognition. In contrast with earlier strategies that alternated between training a model and generating pseudo-labels (PLs) with it, here PLs are generated in end-to-end manner as training proceeds, improving training speed and the accuracy of the final model.  PL shares a common theme with teacher-student models such as distillation in that a teacher model generates targets that need to be mimicked by the student model being trained. However, interestingly, PL strategies in general use hard-labels, whereas distillation uses the distribution over labels as the target to mimic. Inspired by distillation we expect that specifying the whole distribution (aka soft-labels) over sequences as the target for unlabeled data, instead of a single best pass pseudo-labeled transcript (hard-labels) should improve PL performance and convergence. Surprisingly and unexpectedly, we find that soft-labels targets can lead to training divergence, with the model collapsing to a degenerate token distribution per frame. We hypothesize that the reason this does not happen with hard-labels is that training loss on hard-labels imposes sequence-level consistency that keeps the model from collapsing to the degenerate solution.  In this paper, we show several experiments that support this hypothesis, and experiment with several regularization approaches that can ameliorate the degenerate collapse when using soft-labels. These approaches can bring the accuracy of soft-labels closer to that of hard-labels, and while they are unable to outperform them yet, they serve as a useful framework for further improvements.}
}

Endnote

%0 Conference Paper
%T Continuous Soft Pseudo-Labeling in ASR 
%A Tatiana Likhomanenko
%A Ronan Collobert
%A Navdeep Jaitly
%A Samy Bengio
%B Proceedings on "I Can't Believe It's Not Better!  - Understanding Deep Learning Through Empirical Falsification" at NeurIPS 2022 Workshops
%C Proceedings of Machine Learning Research
%D 2023
%E Javier Antorán
%E Arno Blaas
%E Fan Feng
%E Sahra Ghalebikesabi
%E Ian Mason
%E Melanie F. Pradier
%E David Rohde
%E Francisco J. R. Ruiz
%E Aaron Schein	
%F pmlr-v187-likhomanenko23a
%I PMLR
%P 66--84
%U https://proceedings.mlr.press/v187/likhomanenko23a.html
%V 187
%X Continuous pseudo-labeling (PL) algorithms such as slimIPL have recently emerged as a powerful strategy for semi-supervised learning in speech recognition. In contrast with earlier strategies that alternated between training a model and generating pseudo-labels (PLs) with it, here PLs are generated in end-to-end manner as training proceeds, improving training speed and the accuracy of the final model.  PL shares a common theme with teacher-student models such as distillation in that a teacher model generates targets that need to be mimicked by the student model being trained. However, interestingly, PL strategies in general use hard-labels, whereas distillation uses the distribution over labels as the target to mimic. Inspired by distillation we expect that specifying the whole distribution (aka soft-labels) over sequences as the target for unlabeled data, instead of a single best pass pseudo-labeled transcript (hard-labels) should improve PL performance and convergence. Surprisingly and unexpectedly, we find that soft-labels targets can lead to training divergence, with the model collapsing to a degenerate token distribution per frame. We hypothesize that the reason this does not happen with hard-labels is that training loss on hard-labels imposes sequence-level consistency that keeps the model from collapsing to the degenerate solution.  In this paper, we show several experiments that support this hypothesis, and experiment with several regularization approaches that can ameliorate the degenerate collapse when using soft-labels. These approaches can bring the accuracy of soft-labels closer to that of hard-labels, and while they are unable to outperform them yet, they serve as a useful framework for further improvements.

APA


Likhomanenko, T., Collobert, R., Jaitly, N. & Bengio, S.. (2023). Continuous Soft Pseudo-Labeling in ASR . Proceedings on "I Can't Believe It's Not Better!  - Understanding Deep Learning Through Empirical Falsification" at NeurIPS 2022 Workshops, in Proceedings of Machine Learning Research 187:66-84 Available from https://proceedings.mlr.press/v187/likhomanenko23a.html.

Related Material

Download PDF