Bootstrapping a high quality multilingual multimodal dataset for Bletchley

Owais Khan Mohammed, Kriti Aggarwal, Qiang Liu, Saksham Singhal, Johan Bjorck, Subhojit Som
Proceedings of The 14th Asian Conference on Machine Learning, PMLR 189:738-753, 2023.

Abstract

Vision-language models have recently made impressive strides, primarily driven by large-scale training on web data. While pioneering works such as CLIP and ALIGN show significant improvements, these are focused on English data as it is easy to source them from the web. Towards serving non-English-speaking demographics, we consider various methods for generating multilingual data and find that a simple bootstrapping mechanism works surprisingly well. Specifically, just using English image captions data and text-only multilingual translation pairs we train a fairly strong multilingual vision-language model and then leverage it to create a much cleaner version of the multilingual image captions dataset we collected. We demonstrate that this dataset which was used to train Bletchley result in a strong multi-modal and multilingual model which reaches strong performance across several multilingual zero-shot tasks. Specifically, Bletchley achieves state-of-the-art results on multilingual COCO, Multi30k sets, IGLUE WIT and xFlickr&CO datasets.

Cite this Paper


BibTeX
@InProceedings{pmlr-v189-mohammed23a, title = {Bootstrapping a high quality multilingual multimodal dataset for Bletchley}, author = {Mohammed, Owais Khan and Aggarwal, Kriti and Liu, Qiang and Singhal, Saksham and Bjorck, Johan and Som, Subhojit}, booktitle = {Proceedings of The 14th Asian Conference on Machine Learning}, pages = {738--753}, year = {2023}, editor = {Khan, Emtiyaz and Gonen, Mehmet}, volume = {189}, series = {Proceedings of Machine Learning Research}, month = {12--14 Dec}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v189/mohammed23a/mohammed23a.pdf}, url = {https://proceedings.mlr.press/v189/mohammed23a.html}, abstract = {Vision-language models have recently made impressive strides, primarily driven by large-scale training on web data. While pioneering works such as CLIP and ALIGN show significant improvements, these are focused on English data as it is easy to source them from the web. Towards serving non-English-speaking demographics, we consider various methods for generating multilingual data and find that a simple bootstrapping mechanism works surprisingly well. Specifically, just using English image captions data and text-only multilingual translation pairs we train a fairly strong multilingual vision-language model and then leverage it to create a much cleaner version of the multilingual image captions dataset we collected. We demonstrate that this dataset which was used to train Bletchley result in a strong multi-modal and multilingual model which reaches strong performance across several multilingual zero-shot tasks. Specifically, Bletchley achieves state-of-the-art results on multilingual COCO, Multi30k sets, IGLUE WIT and xFlickr&CO datasets.} }
Endnote
%0 Conference Paper %T Bootstrapping a high quality multilingual multimodal dataset for Bletchley %A Owais Khan Mohammed %A Kriti Aggarwal %A Qiang Liu %A Saksham Singhal %A Johan Bjorck %A Subhojit Som %B Proceedings of The 14th Asian Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Emtiyaz Khan %E Mehmet Gonen %F pmlr-v189-mohammed23a %I PMLR %P 738--753 %U https://proceedings.mlr.press/v189/mohammed23a.html %V 189 %X Vision-language models have recently made impressive strides, primarily driven by large-scale training on web data. While pioneering works such as CLIP and ALIGN show significant improvements, these are focused on English data as it is easy to source them from the web. Towards serving non-English-speaking demographics, we consider various methods for generating multilingual data and find that a simple bootstrapping mechanism works surprisingly well. Specifically, just using English image captions data and text-only multilingual translation pairs we train a fairly strong multilingual vision-language model and then leverage it to create a much cleaner version of the multilingual image captions dataset we collected. We demonstrate that this dataset which was used to train Bletchley result in a strong multi-modal and multilingual model which reaches strong performance across several multilingual zero-shot tasks. Specifically, Bletchley achieves state-of-the-art results on multilingual COCO, Multi30k sets, IGLUE WIT and xFlickr&CO datasets.
APA
Mohammed, O.K., Aggarwal, K., Liu, Q., Singhal, S., Bjorck, J. & Som, S.. (2023). Bootstrapping a high quality multilingual multimodal dataset for Bletchley. Proceedings of The 14th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 189:738-753 Available from https://proceedings.mlr.press/v189/mohammed23a.html.

Related Material