[edit]
PatchPrune: Reducing Hallucinations in Vision Language Models by Pruning Redundant Image Patches
Proceedings of 2025 2nd International Conference on Machine Learning and Intelligent Computing, PMLR 278:298-304, 2025.
Abstract
Large language models (LLMs) have advanced significantly in natural language processing, and vision language models (VLMs) have extended this progress to tasks like image captioning and visual question answering (VQA). Despite this success, VLMs often generate hallucinated or factually inconsistent contents. Traditional methods focus on improving model reasoning by modifying the inference procedure, but we propose a new approach: PatchPrune, which dynamically prunes redundant or uninformative image patches, using a composite importance score based on activation magnitude and feature entropy. As shown in Figure By reducing input noise, PatchPrune enables the model to focus on relevant features, improving the accuracy and reliability of its outputs. Experimental results show that PatchPrune enhances multimodal reasoning and mitigates hallucinations effectively.