LEMoN: Label Error Detection using Multimodal Neighbors

Haoran Zhang, Aparna Balagopalan, Nassim Oufattole, Hyewon Jeong, Yan Wu, Jiacheng Zhu, Marzyeh Ghassemi
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:74447-74489, 2025.

Abstract

Large repositories of image-caption pairs are essential for the development of vision-language models. However, these datasets are often extracted from noisy data scraped from the web, and contain many mislabeled instances. In order to improve the reliability of downstream models, it is important to identify and filter images with incorrect captions. However, beyond filtering based on image-caption embedding similarity, no prior works have proposed other methods to filter noisy multimodal data, or concretely assessed the impact of noisy captioning data on downstream training. In this work, we propose, theoretically justify, and empirically validate LEMoN, a method to identify label errors in image-caption datasets. Our method leverages the multimodal neighborhood of image-caption pairs in the latent space of contrastively pretrained multimodal models to automatically identify label errors. Through empirical evaluations across eight datasets and twelve baselines, we find that LEMoN outperforms the baselines by over 3% in label error detection, and that training on datasets filtered using our method improves downstream captioning performance by more than 2 BLEU points over noisy training.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-zhang25b, title = {{LEM}o{N}: Label Error Detection using Multimodal Neighbors}, author = {Zhang, Haoran and Balagopalan, Aparna and Oufattole, Nassim and Jeong, Hyewon and Wu, Yan and Zhu, Jiacheng and Ghassemi, Marzyeh}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {74447--74489}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/zhang25b/zhang25b.pdf}, url = {https://proceedings.mlr.press/v267/zhang25b.html}, abstract = {Large repositories of image-caption pairs are essential for the development of vision-language models. However, these datasets are often extracted from noisy data scraped from the web, and contain many mislabeled instances. In order to improve the reliability of downstream models, it is important to identify and filter images with incorrect captions. However, beyond filtering based on image-caption embedding similarity, no prior works have proposed other methods to filter noisy multimodal data, or concretely assessed the impact of noisy captioning data on downstream training. In this work, we propose, theoretically justify, and empirically validate LEMoN, a method to identify label errors in image-caption datasets. Our method leverages the multimodal neighborhood of image-caption pairs in the latent space of contrastively pretrained multimodal models to automatically identify label errors. Through empirical evaluations across eight datasets and twelve baselines, we find that LEMoN outperforms the baselines by over 3% in label error detection, and that training on datasets filtered using our method improves downstream captioning performance by more than 2 BLEU points over noisy training.} }
Endnote
%0 Conference Paper %T LEMoN: Label Error Detection using Multimodal Neighbors %A Haoran Zhang %A Aparna Balagopalan %A Nassim Oufattole %A Hyewon Jeong %A Yan Wu %A Jiacheng Zhu %A Marzyeh Ghassemi %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-zhang25b %I PMLR %P 74447--74489 %U https://proceedings.mlr.press/v267/zhang25b.html %V 267 %X Large repositories of image-caption pairs are essential for the development of vision-language models. However, these datasets are often extracted from noisy data scraped from the web, and contain many mislabeled instances. In order to improve the reliability of downstream models, it is important to identify and filter images with incorrect captions. However, beyond filtering based on image-caption embedding similarity, no prior works have proposed other methods to filter noisy multimodal data, or concretely assessed the impact of noisy captioning data on downstream training. In this work, we propose, theoretically justify, and empirically validate LEMoN, a method to identify label errors in image-caption datasets. Our method leverages the multimodal neighborhood of image-caption pairs in the latent space of contrastively pretrained multimodal models to automatically identify label errors. Through empirical evaluations across eight datasets and twelve baselines, we find that LEMoN outperforms the baselines by over 3% in label error detection, and that training on datasets filtered using our method improves downstream captioning performance by more than 2 BLEU points over noisy training.
APA
Zhang, H., Balagopalan, A., Oufattole, N., Jeong, H., Wu, Y., Zhu, J. & Ghassemi, M.. (2025). LEMoN: Label Error Detection using Multimodal Neighbors. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:74447-74489 Available from https://proceedings.mlr.press/v267/zhang25b.html.

Related Material