MEFN: A Multi-scale Entropy-aware Fusion Network For Image-Text Retrieval

Jinjin Liu; Changchang Fan

MEFN: A Multi-scale Entropy-aware Fusion Network For Image-Text Retrieval

Jinjin Liu, Changchang Fan

Proceedings of 2025 2nd International Conference on Machine Learning and Intelligent Computing, PMLR 278:768-778, 2025.

Abstract

Image-Text Retrieval(ITR), a crucial task in multi-modal learning, aims to achieve cross-modal information retrieval through semantic alignment and matching between images and text. With the advancement of deep learning, significant progress has been made in the accuracy and efficiency of ITR methods. However, existing approaches still face challenges such as modality heterogeneity, information redundancy, and insufficient multi-scale feature alignment between images and text. To address these issues, this paper proposes an Image-Text Retrieval method based on a Multi-scale Entropy-aware Fusion Network (MEFN). By introducing entropy-aware modeling and multi-scale attention mechanisms, this method enhances the correlation between image and text features, further improving cross-modal semantic matching capabilities. Specifically, MEFN first guides the fusion of image and text features through an entropy-aware model, then finely models multi-scale features using local and global attention mechanisms to generate efficient image-text fusion representations. Experimental results demonstrate that MEFN significantly improves the accuracy and robustness of image-text retrieval compared to mainstream methods on benchmark datasets such as Flickr30K and MSCOCO, especially showing superior performance in fine-grained object matching and complex scenarios. This study provides a new perspective for image-text retrieval methods and holds promise for further applications in multi-lingual image-text retrieval and video-text retrieval fields.

Cite this Paper

BibTeX

@InProceedings{pmlr-v278-liu25f,
  title = 	 {MEFN: A Multi-scale Entropy-aware Fusion Network For Image-Text Retrieval},
  author =       {Liu, Jinjin and Fan, Changchang},
  booktitle = 	 {Proceedings of 2025 2nd International Conference on Machine Learning and Intelligent Computing},
  pages = 	 {768--778},
  year = 	 {2025},
  editor = 	 {Zeng, Nianyin and Pachori, Ram Bilas and Wang, Dongshu},
  volume = 	 {278},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {25--27 Apr},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v278/main/assets/liu25f/liu25f.pdf},
  url = 	 {https://proceedings.mlr.press/v278/liu25f.html},
  abstract = 	 {Image-Text Retrieval(ITR), a crucial task in multi-modal learning, aims to achieve cross-modal information retrieval through semantic alignment and matching between images and text. With the advancement of deep learning, significant progress has been made in the accuracy and efficiency of ITR methods. However, existing approaches still face challenges such as modality heterogeneity, information redundancy, and insufficient multi-scale feature alignment between images and text. To address these issues, this paper proposes an Image-Text Retrieval method based on a Multi-scale Entropy-aware Fusion Network (MEFN). By introducing entropy-aware modeling and multi-scale attention mechanisms, this method enhances the correlation between image and text features, further improving cross-modal semantic matching capabilities. Specifically, MEFN first guides the fusion of image and text features through an entropy-aware model, then finely models multi-scale features using local and global attention mechanisms to generate efficient image-text fusion representations. Experimental results demonstrate that MEFN significantly improves the accuracy and robustness of image-text retrieval compared to mainstream methods on benchmark datasets such as Flickr30K and MSCOCO, especially showing superior performance in fine-grained object matching and complex scenarios. This study provides a new perspective for image-text retrieval methods and holds promise for further applications in multi-lingual image-text retrieval and video-text retrieval fields. }
}

Endnote

%0 Conference Paper
%T MEFN: A Multi-scale Entropy-aware Fusion Network For Image-Text Retrieval
%A Jinjin Liu
%A Changchang Fan
%B Proceedings of 2025 2nd International Conference on Machine Learning and Intelligent Computing
%C Proceedings of Machine Learning Research
%D 2025
%E Nianyin Zeng
%E Ram Bilas Pachori
%E Dongshu Wang	
%F pmlr-v278-liu25f
%I PMLR
%P 768--778
%U https://proceedings.mlr.press/v278/liu25f.html
%V 278
%X Image-Text Retrieval(ITR), a crucial task in multi-modal learning, aims to achieve cross-modal information retrieval through semantic alignment and matching between images and text. With the advancement of deep learning, significant progress has been made in the accuracy and efficiency of ITR methods. However, existing approaches still face challenges such as modality heterogeneity, information redundancy, and insufficient multi-scale feature alignment between images and text. To address these issues, this paper proposes an Image-Text Retrieval method based on a Multi-scale Entropy-aware Fusion Network (MEFN). By introducing entropy-aware modeling and multi-scale attention mechanisms, this method enhances the correlation between image and text features, further improving cross-modal semantic matching capabilities. Specifically, MEFN first guides the fusion of image and text features through an entropy-aware model, then finely models multi-scale features using local and global attention mechanisms to generate efficient image-text fusion representations. Experimental results demonstrate that MEFN significantly improves the accuracy and robustness of image-text retrieval compared to mainstream methods on benchmark datasets such as Flickr30K and MSCOCO, especially showing superior performance in fine-grained object matching and complex scenarios. This study provides a new perspective for image-text retrieval methods and holds promise for further applications in multi-lingual image-text retrieval and video-text retrieval fields.

APA

Liu, J. & Fan, C.. (2025). MEFN: A Multi-scale Entropy-aware Fusion Network For Image-Text Retrieval. Proceedings of 2025 2nd International Conference on Machine Learning and Intelligent Computing, in Proceedings of Machine Learning Research 278:768-778 Available from https://proceedings.mlr.press/v278/liu25f.html.

Related Material

Download PDF