[edit]
DCoT: Dual Chain-of-Thought Prompting for Large Multimodal Models
Proceedings of the 16th Asian Conference on Machine Learning, PMLR 260:1064-1079, 2025.
Abstract
Inference augmentation techniques such as Chain-of-Thought have already made their mark in Large Language Models (LLMs). However, transferring these advances to Large Multimodal Models (LMMs) presents greater challenges. Drawing inspiration from human cognitive processes, this paper proposes a plug-and-play Dual Chain-of-Thought strategy, a novel pipeline that combines visual and textual guidance to improve the performance of LMMs in complex multimodal tasks. The DCoT strategy uses a dual guidance mechanism to use bounding box markers to guide the model’s attention to the image region related to the query problem in the visual aspect, so as to achieve fine-grained image guidance, and in the text aspect, we propose a Fast In-Context Retrieval Framework (FICRF) dynamically and automatically obtains the most suitable examples from the well-built demonstration example cluster as context guidance according to the current problem. This bimodal approach that utilizes visual and textual guidance enhances the inference capabilities of LMMs. Extensive experiments on different LMMs and benchmark datasets have validated its effectiveness, opening up a new path in multimodal inference. Showcasing how the synergistic combination of visual and textual instructions can take the performance of these models to new heights, while demonstrating the potential of Chain-of-Thought and In-Context Learning as a superior alternative to the fine-tuning of LMMs.