On the Learning of Non-Autoregressive Transformers

Fei Huang, Tianhua Tao, Hao Zhou, Lei Li, Minlie Huang
Proceedings of the 39th International Conference on Machine Learning, PMLR 162:9356-9376, 2022.

Abstract

Non-autoregressive Transformer (NAT) is a family of text generation models, which aims to reduce the decoding latency by predicting the whole sentences in parallel. However, such latency reduction sacrifices the ability to capture left-to-right dependencies, thereby making NAT learning very challenging. In this paper, we present theoretical and empirical analyses to reveal the challenges of NAT learning and propose a unified perspective to understand existing successes. First, we show that simply training NAT by maximizing the likelihood can lead to an approximation of marginal distributions but drops all dependencies between tokens, where the dropped information can be measured by the dataset’s conditional total correlation. Second, we formalize many previous objectives in a unified framework and show that their success can be concluded as maximizing the likelihood on a proxy distribution, leading to a reduced information loss. Empirical studies show that our perspective can explain the phenomena in NAT learning and guide the design of new training methods.

Cite this Paper


BibTeX
@InProceedings{pmlr-v162-huang22k, title = {On the Learning of Non-Autoregressive Transformers}, author = {Huang, Fei and Tao, Tianhua and Zhou, Hao and Li, Lei and Huang, Minlie}, booktitle = {Proceedings of the 39th International Conference on Machine Learning}, pages = {9356--9376}, year = {2022}, editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan}, volume = {162}, series = {Proceedings of Machine Learning Research}, month = {17--23 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v162/huang22k/huang22k.pdf}, url = {https://proceedings.mlr.press/v162/huang22k.html}, abstract = {Non-autoregressive Transformer (NAT) is a family of text generation models, which aims to reduce the decoding latency by predicting the whole sentences in parallel. However, such latency reduction sacrifices the ability to capture left-to-right dependencies, thereby making NAT learning very challenging. In this paper, we present theoretical and empirical analyses to reveal the challenges of NAT learning and propose a unified perspective to understand existing successes. First, we show that simply training NAT by maximizing the likelihood can lead to an approximation of marginal distributions but drops all dependencies between tokens, where the dropped information can be measured by the dataset’s conditional total correlation. Second, we formalize many previous objectives in a unified framework and show that their success can be concluded as maximizing the likelihood on a proxy distribution, leading to a reduced information loss. Empirical studies show that our perspective can explain the phenomena in NAT learning and guide the design of new training methods.} }
Endnote
%0 Conference Paper %T On the Learning of Non-Autoregressive Transformers %A Fei Huang %A Tianhua Tao %A Hao Zhou %A Lei Li %A Minlie Huang %B Proceedings of the 39th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2022 %E Kamalika Chaudhuri %E Stefanie Jegelka %E Le Song %E Csaba Szepesvari %E Gang Niu %E Sivan Sabato %F pmlr-v162-huang22k %I PMLR %P 9356--9376 %U https://proceedings.mlr.press/v162/huang22k.html %V 162 %X Non-autoregressive Transformer (NAT) is a family of text generation models, which aims to reduce the decoding latency by predicting the whole sentences in parallel. However, such latency reduction sacrifices the ability to capture left-to-right dependencies, thereby making NAT learning very challenging. In this paper, we present theoretical and empirical analyses to reveal the challenges of NAT learning and propose a unified perspective to understand existing successes. First, we show that simply training NAT by maximizing the likelihood can lead to an approximation of marginal distributions but drops all dependencies between tokens, where the dropped information can be measured by the dataset’s conditional total correlation. Second, we formalize many previous objectives in a unified framework and show that their success can be concluded as maximizing the likelihood on a proxy distribution, leading to a reduced information loss. Empirical studies show that our perspective can explain the phenomena in NAT learning and guide the design of new training methods.
APA
Huang, F., Tao, T., Zhou, H., Li, L. & Huang, M.. (2022). On the Learning of Non-Autoregressive Transformers. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:9356-9376 Available from https://proceedings.mlr.press/v162/huang22k.html.

Related Material