[edit]
Bootstrapping a high quality multilingual multimodal dataset for Bletchley
Proceedings of The 14th Asian Conference on Machine
Learning, PMLR 189:738-753, 2023.
Abstract
Vision-language models have recently made impressive
strides, primarily driven by large-scale training on
web data. While pioneering works such as CLIP and
ALIGN show significant improvements, these are
focused on English data as it is easy to source them
from the web. Towards serving non-English-speaking
demographics, we consider various methods for
generating multilingual data and find that a simple
bootstrapping mechanism works surprisingly
well. Specifically, just using English image
captions data and text-only multilingual translation
pairs we train a fairly strong multilingual
vision-language model and then leverage it to create
a much cleaner version of the multilingual image
captions dataset we collected. We demonstrate that
this dataset which was used to train Bletchley
result in a strong multi-modal and multilingual
model which reaches strong performance across
several multilingual zero-shot tasks. Specifically,
Bletchley achieves state-of-the-art results on
multilingual COCO, Multi30k sets, IGLUE WIT and
xFlickr&CO datasets.