Bootstrapping a high quality multilingual multimodal
 dataset for Bletchley

Owais Khan Mohammed; Kriti Aggarwal; Qiang Liu; Saksham Singhal; Johan Bjorck; Subhojit Som

Bootstrapping a high quality multilingual multimodal dataset for Bletchley

Owais Khan Mohammed, Kriti Aggarwal, Qiang Liu, Saksham Singhal, Johan Bjorck, Subhojit Som

Proceedings of The 14th Asian Conference on Machine Learning, PMLR 189:738-753, 2023.

Abstract

Vision-language models have recently made impressive strides, primarily driven by large-scale training on web data. While pioneering works such as CLIP and ALIGN show significant improvements, these are focused on English data as it is easy to source them from the web. Towards serving non-English-speaking demographics, we consider various methods for generating multilingual data and find that a simple bootstrapping mechanism works surprisingly well. Specifically, just using English image captions data and text-only multilingual translation pairs we train a fairly strong multilingual vision-language model and then leverage it to create a much cleaner version of the multilingual image captions dataset we collected. We demonstrate that this dataset which was used to train Bletchley result in a strong multi-modal and multilingual model which reaches strong performance across several multilingual zero-shot tasks. Specifically, Bletchley achieves state-of-the-art results on multilingual COCO, Multi30k sets, IGLUE WIT and xFlickr&CO datasets.

Cite this Paper

BibTeX


@InProceedings{pmlr-v189-mohammed23a,
  title = 	 {Bootstrapping a high quality multilingual multimodal
 dataset for Bletchley},
  author =       {Mohammed, Owais Khan and Aggarwal, Kriti and Liu, Qiang and Singhal, Saksham and Bjorck, Johan and Som, Subhojit},
  booktitle = 	 {Proceedings of The 14th Asian Conference on Machine
 Learning},
  pages = 	 {738--753},
  year = 	 {2023},
  editor = 	 {Khan, Emtiyaz and Gonen, Mehmet},
  volume = 	 {189},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {12--14 Dec},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v189/mohammed23a/mohammed23a.pdf},
  url = 	 {https://proceedings.mlr.press/v189/mohammed23a.html},
  abstract = 	 {Vision-language models have recently made impressive
 strides, primarily driven by large-scale training on
 web data. While pioneering works such as CLIP and
 ALIGN show significant improvements, these are
 focused on English data as it is easy to source them
 from the web. Towards serving non-English-speaking
 demographics, we consider various methods for
 generating multilingual data and find that a simple
 bootstrapping mechanism works surprisingly
 well. Specifically, just using English image
 captions data and text-only multilingual translation
 pairs we train a fairly strong multilingual
 vision-language model and then leverage it to create
 a much cleaner version of the multilingual image
 captions dataset we collected. We demonstrate that
 this dataset which was used to train Bletchley
 result in a strong multi-modal and multilingual
 model which reaches strong performance across
 several multilingual zero-shot tasks. Specifically,
 Bletchley achieves state-of-the-art results on
 multilingual COCO, Multi30k sets, IGLUE WIT and
 xFlickr&CO datasets.}
}

Endnote

%0 Conference Paper
%T Bootstrapping a high quality multilingual multimodal
 dataset for Bletchley
%A Owais Khan Mohammed
%A Kriti Aggarwal
%A Qiang Liu
%A Saksham Singhal
%A Johan Bjorck
%A Subhojit Som
%B Proceedings of The 14th Asian Conference on Machine
 Learning
%C Proceedings of Machine Learning Research
%D 2023
%E Emtiyaz Khan
%E Mehmet Gonen	
%F pmlr-v189-mohammed23a
%I PMLR
%P 738--753
%U https://proceedings.mlr.press/v189/mohammed23a.html
%V 189
%X Vision-language models have recently made impressive
 strides, primarily driven by large-scale training on
 web data. While pioneering works such as CLIP and
 ALIGN show significant improvements, these are
 focused on English data as it is easy to source them
 from the web. Towards serving non-English-speaking
 demographics, we consider various methods for
 generating multilingual data and find that a simple
 bootstrapping mechanism works surprisingly
 well. Specifically, just using English image
 captions data and text-only multilingual translation
 pairs we train a fairly strong multilingual
 vision-language model and then leverage it to create
 a much cleaner version of the multilingual image
 captions dataset we collected. We demonstrate that
 this dataset which was used to train Bletchley
 result in a strong multi-modal and multilingual
 model which reaches strong performance across
 several multilingual zero-shot tasks. Specifically,
 Bletchley achieves state-of-the-art results on
 multilingual COCO, Multi30k sets, IGLUE WIT and
 xFlickr&CO datasets.

APA


Mohammed, O.K., Aggarwal, K., Liu, Q., Singhal, S., Bjorck, J. & Som, S.. (2023). Bootstrapping a high quality multilingual multimodal
 dataset for Bletchley. Proceedings of The 14th Asian Conference on Machine
 Learning, in Proceedings of Machine Learning Research 189:738-753 Available from https://proceedings.mlr.press/v189/mohammed23a.html.

Related Material

Download PDF