Directed Acyclic Transformer for Non-Autoregressive Machine Translation

Fei Huang; Hao Zhou; Yang Liu; Hang Li; Minlie Huang

Directed Acyclic Transformer for Non-Autoregressive Machine Translation

Fei Huang, Hao Zhou, Yang Liu, Hang Li, Minlie Huang

Proceedings of the 39th International Conference on Machine Learning, PMLR 162:9410-9428, 2022.

Abstract

Non-autoregressive Transformers (NATs) significantly reduce the decoding latency by generating all tokens in parallel. However, such independent predictions prevent NATs from capturing the dependencies between the tokens for generating multiple possible translations. In this paper, we propose Directed Acyclic Transfomer (DA-Transformer), which represents the hidden states in a Directed Acyclic Graph (DAG), where each path of the DAG corresponds to a specific translation. The whole DAG simultaneously captures multiple translations and facilitates fast predictions in a non-autoregressive fashion. Experiments on the raw training data of WMT benchmark show that DA-Transformer substantially outperforms previous NATs by about 3 BLEU on average, which is the first NAT model that achieves competitive results with autoregressive Transformers without relying on knowledge distillation.

Cite this Paper

BibTeX


@InProceedings{pmlr-v162-huang22m,
  title = 	 {Directed Acyclic Transformer for Non-Autoregressive Machine Translation},
  author =       {Huang, Fei and Zhou, Hao and Liu, Yang and Li, Hang and Huang, Minlie},
  booktitle = 	 {Proceedings of the 39th International Conference on Machine Learning},
  pages = 	 {9410--9428},
  year = 	 {2022},
  editor = 	 {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan},
  volume = 	 {162},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {17--23 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v162/huang22m/huang22m.pdf},
  url = 	 {https://proceedings.mlr.press/v162/huang22m.html},
  abstract = 	 {Non-autoregressive Transformers (NATs) significantly reduce the decoding latency by generating all tokens in parallel. However, such independent predictions prevent NATs from capturing the dependencies between the tokens for generating multiple possible translations. In this paper, we propose Directed Acyclic Transfomer (DA-Transformer), which represents the hidden states in a Directed Acyclic Graph (DAG), where each path of the DAG corresponds to a specific translation. The whole DAG simultaneously captures multiple translations and facilitates fast predictions in a non-autoregressive fashion. Experiments on the raw training data of WMT benchmark show that DA-Transformer substantially outperforms previous NATs by about 3 BLEU on average, which is the first NAT model that achieves competitive results with autoregressive Transformers without relying on knowledge distillation.}
}

Endnote

%0 Conference Paper
%T Directed Acyclic Transformer for Non-Autoregressive Machine Translation
%A Fei Huang
%A Hao Zhou
%A Yang Liu
%A Hang Li
%A Minlie Huang
%B Proceedings of the 39th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2022
%E Kamalika Chaudhuri
%E Stefanie Jegelka
%E Le Song
%E Csaba Szepesvari
%E Gang Niu
%E Sivan Sabato	
%F pmlr-v162-huang22m
%I PMLR
%P 9410--9428
%U https://proceedings.mlr.press/v162/huang22m.html
%V 162
%X Non-autoregressive Transformers (NATs) significantly reduce the decoding latency by generating all tokens in parallel. However, such independent predictions prevent NATs from capturing the dependencies between the tokens for generating multiple possible translations. In this paper, we propose Directed Acyclic Transfomer (DA-Transformer), which represents the hidden states in a Directed Acyclic Graph (DAG), where each path of the DAG corresponds to a specific translation. The whole DAG simultaneously captures multiple translations and facilitates fast predictions in a non-autoregressive fashion. Experiments on the raw training data of WMT benchmark show that DA-Transformer substantially outperforms previous NATs by about 3 BLEU on average, which is the first NAT model that achieves competitive results with autoregressive Transformers without relying on knowledge distillation.

APA


Huang, F., Zhou, H., Liu, Y., Li, H. & Huang, M.. (2022). Directed Acyclic Transformer for Non-Autoregressive Machine Translation. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:9410-9428 Available from https://proceedings.mlr.press/v162/huang22m.html.

Related Material

Download PDF