[edit]
Improving Context Understanding in Multimodal Large Language Models via Multimodal Composition Learning
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:27732-27751, 2024.
Abstract
Previous efforts using frozen Large Language Models (LLMs) for visual understanding, via image captioning or image-text retrieval tasks, face challenges when dealing with complex multimodal scenarios. In order to enhance the capabilities of Multimodal Large Language Models (MLLM) in comprehending the context of vision and language, we introduce Multimodal Composition Learning (MCL) for the purpose of mapping or aligning the vision and language input. In particular, we introduce two tasks: Multimodal-Context Captioning (MC-Cap) and Multimodal-Context Retrieval (MC-Ret) to guide a frozen LLM in comprehending the vision and language context. These specialized tasks are crafted to improve the LLM’s capacity for efficient processing and utilization of multimodal inputs, thereby enhancing its proficiency in generating more accurate text or visual representations. Extensive experiments on both retrieval tasks (i.e., zero-shot composed image retrieval, visual storytelling image retrieval and visual dialog image retrieval) and text generation tasks (i.e., visual question answering) demonstrate the effectiveness of the proposed method. The code is available at: https://github.com/dhg-wei/MCL.