Bootstrapping a high quality multilingual multimodal dataset for Bletchley
Proceedings of The 14th Asian Conference on Machine Learning, PMLR 189:738-753, 2023.
Vision-language models have recently made impressive strides, primarily driven by large-scale training on web data. While pioneering works such as CLIP and ALIGN show significant improvements, these are focused on English data as it is easy to source them from the web. Towards serving non-English-speaking demographics, we consider various methods for generating multilingual data and find that a simple bootstrapping mechanism works surprisingly well. Specifically, just using English image captions data and text-only multilingual translation pairs we train a fairly strong multilingual vision-language model and then leverage it to create a much cleaner version of the multilingual image captions dataset we collected. We demonstrate that this dataset which was used to train Bletchley result in a strong multi-modal and multilingual model which reaches strong performance across several multilingual zero-shot tasks. Specifically, Bletchley achieves state-of-the-art results on multilingual COCO, Multi30k sets, IGLUE WIT and xFlickr&CO datasets.