Sequence Generation with Mixed Representations

Lijun Wu, Shufang Xie, Yingce Xia, Yang Fan, Jian-Huang Lai, Tao Qin, Tieyan Liu
Proceedings of the 37th International Conference on Machine Learning, PMLR 119:10388-10398, 2020.

Abstract

Tokenization is the first step of many natural language processing (NLP) tasks and plays an important role for neural NLP models. Tokenizaton method such as byte-pair encoding (BPE), which can greatly reduce the large vocabulary and deal with out-of-vocabulary words, has shown to be effective and is widely adopted for sequence generation tasks. While various tokenization methods exist, there is no common acknowledgement which is the best. In this work, we propose to leverage the mixed representations from different tokenization methods for sequence generation tasks, in order to boost the model performance with unique characteristics and advantages of individual tokenization methods. Specifically, we introduce a new model architecture to incorporate mixed representations and a co-teaching algorithm to better utilize the diversity of different tokenization methods. Our approach achieves significant improvements on neural machine translation (NMT) tasks with six language pairs (e.g., English$\leftrightarrow$German, English$\leftrightarrow$Romanian), as well as an abstractive summarization task.

Cite this Paper


BibTeX
@InProceedings{pmlr-v119-wu20e, title = {Sequence Generation with Mixed Representations}, author = {Wu, Lijun and Xie, Shufang and Xia, Yingce and Fan, Yang and Lai, Jian-Huang and Qin, Tao and Liu, Tieyan}, booktitle = {Proceedings of the 37th International Conference on Machine Learning}, pages = {10388--10398}, year = {2020}, editor = {III, Hal Daumé and Singh, Aarti}, volume = {119}, series = {Proceedings of Machine Learning Research}, month = {13--18 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v119/wu20e/wu20e.pdf}, url = {https://proceedings.mlr.press/v119/wu20e.html}, abstract = {Tokenization is the first step of many natural language processing (NLP) tasks and plays an important role for neural NLP models. Tokenizaton method such as byte-pair encoding (BPE), which can greatly reduce the large vocabulary and deal with out-of-vocabulary words, has shown to be effective and is widely adopted for sequence generation tasks. While various tokenization methods exist, there is no common acknowledgement which is the best. In this work, we propose to leverage the mixed representations from different tokenization methods for sequence generation tasks, in order to boost the model performance with unique characteristics and advantages of individual tokenization methods. Specifically, we introduce a new model architecture to incorporate mixed representations and a co-teaching algorithm to better utilize the diversity of different tokenization methods. Our approach achieves significant improvements on neural machine translation (NMT) tasks with six language pairs (e.g., English$\leftrightarrow$German, English$\leftrightarrow$Romanian), as well as an abstractive summarization task.} }
Endnote
%0 Conference Paper %T Sequence Generation with Mixed Representations %A Lijun Wu %A Shufang Xie %A Yingce Xia %A Yang Fan %A Jian-Huang Lai %A Tao Qin %A Tieyan Liu %B Proceedings of the 37th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2020 %E Hal Daumé III %E Aarti Singh %F pmlr-v119-wu20e %I PMLR %P 10388--10398 %U https://proceedings.mlr.press/v119/wu20e.html %V 119 %X Tokenization is the first step of many natural language processing (NLP) tasks and plays an important role for neural NLP models. Tokenizaton method such as byte-pair encoding (BPE), which can greatly reduce the large vocabulary and deal with out-of-vocabulary words, has shown to be effective and is widely adopted for sequence generation tasks. While various tokenization methods exist, there is no common acknowledgement which is the best. In this work, we propose to leverage the mixed representations from different tokenization methods for sequence generation tasks, in order to boost the model performance with unique characteristics and advantages of individual tokenization methods. Specifically, we introduce a new model architecture to incorporate mixed representations and a co-teaching algorithm to better utilize the diversity of different tokenization methods. Our approach achieves significant improvements on neural machine translation (NMT) tasks with six language pairs (e.g., English$\leftrightarrow$German, English$\leftrightarrow$Romanian), as well as an abstractive summarization task.
APA
Wu, L., Xie, S., Xia, Y., Fan, Y., Lai, J., Qin, T. & Liu, T.. (2020). Sequence Generation with Mixed Representations. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:10388-10398 Available from https://proceedings.mlr.press/v119/wu20e.html.

Related Material