Non-autoregressive Machine Translation with Disentangled Context Transformer

Jungo Kasai, James Cross, Marjan Ghazvininejad, Jiatao Gu
Proceedings of the 37th International Conference on Machine Learning, PMLR 119:5144-5155, 2020.

Abstract

State-of-the-art neural machine translation models generate a translation from left to right and every step is conditioned on the previously generated tokens. The sequential nature of this generation process causes fundamental latency in inference since we cannot generate multiple tokens in each sentence in parallel. We propose an attention-masking based model, called Disentangled Context (DisCo) transformer, that simultaneously generates all tokens given different contexts. The DisCo transformer is trained to predict every output token given an arbitrary subset of the other reference tokens. We also develop the parallel easy-first inference algorithm, which iteratively refines every token in parallel and reduces the number of required iterations. Our extensive experiments on 7 translation directions with varying data sizes demonstrate that our model achieves competitive, if not better, performance compared to the state of the art in non-autoregressive machine translation while significantly reducing decoding time on average.

Cite this Paper


BibTeX
@InProceedings{pmlr-v119-kasai20a, title = {Non-autoregressive Machine Translation with Disentangled Context Transformer}, author = {Kasai, Jungo and Cross, James and Ghazvininejad, Marjan and Gu, Jiatao}, booktitle = {Proceedings of the 37th International Conference on Machine Learning}, pages = {5144--5155}, year = {2020}, editor = {III, Hal Daumé and Singh, Aarti}, volume = {119}, series = {Proceedings of Machine Learning Research}, month = {13--18 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v119/kasai20a/kasai20a.pdf}, url = {http://proceedings.mlr.press/v119/kasai20a.html}, abstract = {State-of-the-art neural machine translation models generate a translation from left to right and every step is conditioned on the previously generated tokens. The sequential nature of this generation process causes fundamental latency in inference since we cannot generate multiple tokens in each sentence in parallel. We propose an attention-masking based model, called Disentangled Context (DisCo) transformer, that simultaneously generates all tokens given different contexts. The DisCo transformer is trained to predict every output token given an arbitrary subset of the other reference tokens. We also develop the parallel easy-first inference algorithm, which iteratively refines every token in parallel and reduces the number of required iterations. Our extensive experiments on 7 translation directions with varying data sizes demonstrate that our model achieves competitive, if not better, performance compared to the state of the art in non-autoregressive machine translation while significantly reducing decoding time on average.} }
Endnote
%0 Conference Paper %T Non-autoregressive Machine Translation with Disentangled Context Transformer %A Jungo Kasai %A James Cross %A Marjan Ghazvininejad %A Jiatao Gu %B Proceedings of the 37th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2020 %E Hal Daumé III %E Aarti Singh %F pmlr-v119-kasai20a %I PMLR %P 5144--5155 %U http://proceedings.mlr.press/v119/kasai20a.html %V 119 %X State-of-the-art neural machine translation models generate a translation from left to right and every step is conditioned on the previously generated tokens. The sequential nature of this generation process causes fundamental latency in inference since we cannot generate multiple tokens in each sentence in parallel. We propose an attention-masking based model, called Disentangled Context (DisCo) transformer, that simultaneously generates all tokens given different contexts. The DisCo transformer is trained to predict every output token given an arbitrary subset of the other reference tokens. We also develop the parallel easy-first inference algorithm, which iteratively refines every token in parallel and reduces the number of required iterations. Our extensive experiments on 7 translation directions with varying data sizes demonstrate that our model achieves competitive, if not better, performance compared to the state of the art in non-autoregressive machine translation while significantly reducing decoding time on average.
APA
Kasai, J., Cross, J., Ghazvininejad, M. & Gu, J.. (2020). Non-autoregressive Machine Translation with Disentangled Context Transformer. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:5144-5155 Available from http://proceedings.mlr.press/v119/kasai20a.html.

Related Material