Parrot: Multilingual Visual Instruction Tuning

Hai-Long Sun; Da-Wei Zhou; Yang Li; Shiyin Lu; Chao Yi; Qing-Guo Chen; Zhao Xu; Weihua Luo; Kaifu Zhang; De-Chuan Zhan; Han-Jia Ye

Parrot: Multilingual Visual Instruction Tuning

Hai-Long Sun, Da-Wei Zhou, Yang Li, Shiyin Lu, Chao Yi, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, De-Chuan Zhan, Han-Jia Ye

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:57984-58007, 2025.

Abstract

The rapid development of Multimodal Large Language Models (MLLMs), such as GPT-4, marks a significant step toward artificial general intelligence. Existing methods typically align vision encoders with LLMs via supervised fine-tuning (SFT), but this often deteriorates their ability to handle multiple languages as training progresses. We empirically observe that imbalanced SFT datasets, largely English-centric, degrade performance on non-English languages due to the failure in multilingual token alignment. To address this, we propose Parrot, a novel approach that leverages textual guidance for visual token alignment at the language level. Parrot conditions visual tokens on diverse language inputs and uses Mixture-of-Experts (MoE) to align multilingual tokens. By computing cross-attention between initial visual features and textual embeddings, we select the most relevant experts, converting visual tokens into language-specific representations. Additionally, we introduce the Massive Multilingual Multimodal Benchmark (MMMB), a new benchmark comprising 6 languages, 15 categories, and 12,000 questions, to assess multilingual capabilities. Parrot achieves state-of-the-art performance on both the multilingual benchmarks and a wide range of multimodal tasks. Code and dataset are available at: https://github.com/AIDC-AI/Parrot.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-sun25ad,
  title = 	 {Parrot: Multilingual Visual Instruction Tuning},
  author =       {Sun, Hai-Long and Zhou, Da-Wei and Li, Yang and Lu, Shiyin and Yi, Chao and Chen, Qing-Guo and Xu, Zhao and Luo, Weihua and Zhang, Kaifu and Zhan, De-Chuan and Ye, Han-Jia},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {57984--58007},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/sun25ad/sun25ad.pdf},
  url = 	 {https://proceedings.mlr.press/v267/sun25ad.html},
  abstract = 	 {The rapid development of Multimodal Large Language Models (MLLMs), such as GPT-4, marks a significant step toward artificial general intelligence. Existing methods typically align vision encoders with LLMs via supervised fine-tuning (SFT), but this often deteriorates their ability to handle multiple languages as training progresses. We empirically observe that imbalanced SFT datasets, largely English-centric, degrade performance on non-English languages due to the failure in multilingual token alignment. To address this, we propose Parrot, a novel approach that leverages textual guidance for visual token alignment at the language level. Parrot conditions visual tokens on diverse language inputs and uses Mixture-of-Experts (MoE) to align multilingual tokens. By computing cross-attention between initial visual features and textual embeddings, we select the most relevant experts, converting visual tokens into language-specific representations. Additionally, we introduce the Massive Multilingual Multimodal Benchmark (MMMB), a new benchmark comprising 6 languages, 15 categories, and 12,000 questions, to assess multilingual capabilities. Parrot achieves state-of-the-art performance on both the multilingual benchmarks and a wide range of multimodal tasks. Code and dataset are available at: https://github.com/AIDC-AI/Parrot.}
}

Endnote

%0 Conference Paper
%T Parrot: Multilingual Visual Instruction Tuning
%A Hai-Long Sun
%A Da-Wei Zhou
%A Yang Li
%A Shiyin Lu
%A Chao Yi
%A Qing-Guo Chen
%A Zhao Xu
%A Weihua Luo
%A Kaifu Zhang
%A De-Chuan Zhan
%A Han-Jia Ye
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-sun25ad
%I PMLR
%P 57984--58007
%U https://proceedings.mlr.press/v267/sun25ad.html
%V 267
%X The rapid development of Multimodal Large Language Models (MLLMs), such as GPT-4, marks a significant step toward artificial general intelligence. Existing methods typically align vision encoders with LLMs via supervised fine-tuning (SFT), but this often deteriorates their ability to handle multiple languages as training progresses. We empirically observe that imbalanced SFT datasets, largely English-centric, degrade performance on non-English languages due to the failure in multilingual token alignment. To address this, we propose Parrot, a novel approach that leverages textual guidance for visual token alignment at the language level. Parrot conditions visual tokens on diverse language inputs and uses Mixture-of-Experts (MoE) to align multilingual tokens. By computing cross-attention between initial visual features and textual embeddings, we select the most relevant experts, converting visual tokens into language-specific representations. Additionally, we introduce the Massive Multilingual Multimodal Benchmark (MMMB), a new benchmark comprising 6 languages, 15 categories, and 12,000 questions, to assess multilingual capabilities. Parrot achieves state-of-the-art performance on both the multilingual benchmarks and a wide range of multimodal tasks. Code and dataset are available at: https://github.com/AIDC-AI/Parrot.

APA

Sun, H., Zhou, D., Li, Y., Lu, S., Yi, C., Chen, Q., Xu, Z., Luo, W., Zhang, K., Zhan, D. & Ye, H.. (2025). Parrot: Multilingual Visual Instruction Tuning. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:57984-58007 Available from https://proceedings.mlr.press/v267/sun25ad.html.

Parrot: Multilingual Visual Instruction Tuning

Abstract

Cite this Paper

Related Material