XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalisation

Junjie Hu; Sebastian Ruder; Aditya Siddhant; Graham Neubig; Orhan Firat; Melvin Johnson

XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalisation

Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, Melvin Johnson

Proceedings of the 37th International Conference on Machine Learning, PMLR 119:4411-4421, 2020.

Abstract

Much recent progress in applications of machine learning models to NLP has been driven by benchmarks that evaluate models across a wide variety of tasks. However, these broad-coverage benchmarks have been mostly limited to English, and despite an increasing interest in multilingual models, a benchmark that enables the comprehensive evaluation of such methods on a diverse range of languages and tasks is still missing. To this end, we introduce the Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark, a multi-task benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks. We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models, particularly on syntactic and sentence retrieval tasks. There is also a wide spread of results across languages. We will release the benchmark to encourage research on cross-lingual learning methods that transfer linguistic knowledge across a diverse and representative set of languages and tasks.

Cite this Paper

BibTeX

@InProceedings{pmlr-v119-hu20b,
  title = 	 {{XTREME}: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalisation},
  author =       {Hu, Junjie and Ruder, Sebastian and Siddhant, Aditya and Neubig, Graham and Firat, Orhan and Johnson, Melvin},
  booktitle = 	 {Proceedings of the 37th International Conference on Machine Learning},
  pages = 	 {4411--4421},
  year = 	 {2020},
  editor = 	 {III, Hal Daumé and Singh, Aarti},
  volume = 	 {119},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--18 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v119/hu20b/hu20b.pdf},
  url = 	 {https://proceedings.mlr.press/v119/hu20b.html},
  abstract = 	 {Much recent progress in applications of machine learning models to NLP has been driven by benchmarks that evaluate models across a wide variety of tasks. However, these broad-coverage benchmarks have been mostly limited to English, and despite an increasing interest in multilingual models, a benchmark that enables the comprehensive evaluation of such methods on a diverse range of languages and tasks is still missing. To this end, we introduce the Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark, a multi-task benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks. We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models, particularly on syntactic and sentence retrieval tasks. There is also a wide spread of results across languages. We will release the benchmark to encourage research on cross-lingual learning methods that transfer linguistic knowledge across a diverse and representative set of languages and tasks.}
}

Endnote

%0 Conference Paper
%T XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalisation
%A Junjie Hu
%A Sebastian Ruder
%A Aditya Siddhant
%A Graham Neubig
%A Orhan Firat
%A Melvin Johnson
%B Proceedings of the 37th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2020
%E Hal Daumé III
%E Aarti Singh	
%F pmlr-v119-hu20b
%I PMLR
%P 4411--4421
%U https://proceedings.mlr.press/v119/hu20b.html
%V 119
%X Much recent progress in applications of machine learning models to NLP has been driven by benchmarks that evaluate models across a wide variety of tasks. However, these broad-coverage benchmarks have been mostly limited to English, and despite an increasing interest in multilingual models, a benchmark that enables the comprehensive evaluation of such methods on a diverse range of languages and tasks is still missing. To this end, we introduce the Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark, a multi-task benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks. We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models, particularly on syntactic and sentence retrieval tasks. There is also a wide spread of results across languages. We will release the benchmark to encourage research on cross-lingual learning methods that transfer linguistic knowledge across a diverse and representative set of languages and tasks.

APA

Hu, J., Ruder, S., Siddhant, A., Neubig, G., Firat, O. & Johnson, M.. (2020). XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalisation. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:4411-4421 Available from https://proceedings.mlr.press/v119/hu20b.html.

XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalisation

Abstract

Cite this Paper

Related Material