SECOND: Mitigating Perceptual Hallucination in Vision-Language Models via Selective and Contrastive Decoding

Woohyeon Park, Woojin Kim, Jaeik Kim, Jaeyoung Do
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:48027-48040, 2025.

Abstract

Despite significant advancements in Vision-Language Models (VLMs), the performance of existing VLMs remains hindered by object hallucination, a critical challenge to achieving accurate visual understanding. To address this issue, we propose SECOND: Selective and Contrastive Decoding, a novel approach that enables VLMs to effectively leverage multi-scale visual information with an object-centric manner, closely aligning with human visual perception. SECOND progressively selects and integrates multi-scale visual information, facilitating a more precise interpretation of images. By contrasting these visual information iteratively, SECOND significantly reduces perceptual hallucinations and outperforms a wide range of benchmarks. Our theoretical analysis and experiments highlight the largely unexplored potential of multi-scale application in VLMs, showing that prioritizing and contrasting across scales outperforms existing methods.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-park25c, title = {{SECOND}: Mitigating Perceptual Hallucination in Vision-Language Models via Selective and Contrastive Decoding}, author = {Park, Woohyeon and Kim, Woojin and Kim, Jaeik and Do, Jaeyoung}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {48027--48040}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/park25c/park25c.pdf}, url = {https://proceedings.mlr.press/v267/park25c.html}, abstract = {Despite significant advancements in Vision-Language Models (VLMs), the performance of existing VLMs remains hindered by object hallucination, a critical challenge to achieving accurate visual understanding. To address this issue, we propose SECOND: Selective and Contrastive Decoding, a novel approach that enables VLMs to effectively leverage multi-scale visual information with an object-centric manner, closely aligning with human visual perception. SECOND progressively selects and integrates multi-scale visual information, facilitating a more precise interpretation of images. By contrasting these visual information iteratively, SECOND significantly reduces perceptual hallucinations and outperforms a wide range of benchmarks. Our theoretical analysis and experiments highlight the largely unexplored potential of multi-scale application in VLMs, showing that prioritizing and contrasting across scales outperforms existing methods.} }
Endnote
%0 Conference Paper %T SECOND: Mitigating Perceptual Hallucination in Vision-Language Models via Selective and Contrastive Decoding %A Woohyeon Park %A Woojin Kim %A Jaeik Kim %A Jaeyoung Do %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-park25c %I PMLR %P 48027--48040 %U https://proceedings.mlr.press/v267/park25c.html %V 267 %X Despite significant advancements in Vision-Language Models (VLMs), the performance of existing VLMs remains hindered by object hallucination, a critical challenge to achieving accurate visual understanding. To address this issue, we propose SECOND: Selective and Contrastive Decoding, a novel approach that enables VLMs to effectively leverage multi-scale visual information with an object-centric manner, closely aligning with human visual perception. SECOND progressively selects and integrates multi-scale visual information, facilitating a more precise interpretation of images. By contrasting these visual information iteratively, SECOND significantly reduces perceptual hallucinations and outperforms a wide range of benchmarks. Our theoretical analysis and experiments highlight the largely unexplored potential of multi-scale application in VLMs, showing that prioritizing and contrasting across scales outperforms existing methods.
APA
Park, W., Kim, W., Kim, J. & Do, J.. (2025). SECOND: Mitigating Perceptual Hallucination in Vision-Language Models via Selective and Contrastive Decoding. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:48027-48040 Available from https://proceedings.mlr.press/v267/park25c.html.

Related Material