Probability Distillation: A Caveat and Alternatives

Chin-Wei Huang, Faruk Ahmed, Kundan Kumar, Alexandre Lacoste, Aaron Courville
Proceedings of The 35th Uncertainty in Artificial Intelligence Conference, PMLR 115:1212-1221, 2020.

Abstract

Due to Van den Oord et al. (2018), probability distillation has recently been of interest to deep learning practitioners, where, as a practical workaround for deploying autoregressive models in real-time applications, a student net-work is used to obtain quality samples in parallel. We identify a pathological optimization issue with the adopted stochastic minimization of the reverse-KL divergence: the curse of dimensionality results in a skewed gradient distribution that renders training inefficient. This means that KL-based "evaluative" training can be susceptible to poor exploration if the target distribution is highly structured. We then explore alternative principles for distillation, including one with an "instructive" signal, and show that it is possible to achieve qualitatively better results than with KL minimization.

Cite this Paper


BibTeX
@InProceedings{pmlr-v115-huang20c, title = {Probability Distillation: A Caveat and Alternatives}, author = {Huang, Chin-Wei and Ahmed, Faruk and Kumar, Kundan and Lacoste, Alexandre and Courville, Aaron}, booktitle = {Proceedings of The 35th Uncertainty in Artificial Intelligence Conference}, pages = {1212--1221}, year = {2020}, editor = {Adams, Ryan P. and Gogate, Vibhav}, volume = {115}, series = {Proceedings of Machine Learning Research}, month = {22--25 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v115/huang20c/huang20c.pdf}, url = {https://proceedings.mlr.press/v115/huang20c.html}, abstract = {Due to Van den Oord et al. (2018), probability distillation has recently been of interest to deep learning practitioners, where, as a practical workaround for deploying autoregressive models in real-time applications, a student net-work is used to obtain quality samples in parallel. We identify a pathological optimization issue with the adopted stochastic minimization of the reverse-KL divergence: the curse of dimensionality results in a skewed gradient distribution that renders training inefficient. This means that KL-based "evaluative" training can be susceptible to poor exploration if the target distribution is highly structured. We then explore alternative principles for distillation, including one with an "instructive" signal, and show that it is possible to achieve qualitatively better results than with KL minimization.} }
Endnote
%0 Conference Paper %T Probability Distillation: A Caveat and Alternatives %A Chin-Wei Huang %A Faruk Ahmed %A Kundan Kumar %A Alexandre Lacoste %A Aaron Courville %B Proceedings of The 35th Uncertainty in Artificial Intelligence Conference %C Proceedings of Machine Learning Research %D 2020 %E Ryan P. Adams %E Vibhav Gogate %F pmlr-v115-huang20c %I PMLR %P 1212--1221 %U https://proceedings.mlr.press/v115/huang20c.html %V 115 %X Due to Van den Oord et al. (2018), probability distillation has recently been of interest to deep learning practitioners, where, as a practical workaround for deploying autoregressive models in real-time applications, a student net-work is used to obtain quality samples in parallel. We identify a pathological optimization issue with the adopted stochastic minimization of the reverse-KL divergence: the curse of dimensionality results in a skewed gradient distribution that renders training inefficient. This means that KL-based "evaluative" training can be susceptible to poor exploration if the target distribution is highly structured. We then explore alternative principles for distillation, including one with an "instructive" signal, and show that it is possible to achieve qualitatively better results than with KL minimization.
APA
Huang, C., Ahmed, F., Kumar, K., Lacoste, A. & Courville, A.. (2020). Probability Distillation: A Caveat and Alternatives. Proceedings of The 35th Uncertainty in Artificial Intelligence Conference, in Proceedings of Machine Learning Research 115:1212-1221 Available from https://proceedings.mlr.press/v115/huang20c.html.

Related Material