HD-MF: Hierarchical Dynamic-aware Multimodal Fusion for Fine-Grained Bird Recognition

Junjing Li, Xing Liu, Jiu Luo
Proceedings of 2025 2nd International Conference on Machine Learning and Intelligent Computing, PMLR 278:604-613, 2025.

Abstract

Fine-grained bird recognition plays a crucial role in biodiversity monitoring. Its primary challenge lies in identifying subtle inter-class visual differences and overcoming the inherent limitations of unimodal information. Audio provides crucial complementary cues, yet audiovisual fusion still faces challenges such as the semantic gap. To address these challenges, this paper proposed a hierarchical dynamic-aware multimodal fusion (HD-MF) architecture. This architecture captures locally aligned cross-modal features via its Cross-modal Spatial Interaction Module, extracts global high-order cross-modal correlations using the Factorized Bilinear Fusion Module, and dynamically integrates the outputs of these two fusion approaches through a Dynamically Adaptive Gated Fusion Unit. Evaluated on AViS, a paired audiovisual dataset constructed for this study, HD-MF achieved state-of-the-art performance. Experimental results demonstrated that HD-MF effectively integrates audiovisual complementary information, providing a novel and effective approach for enhancing fine-grained bird recognition performance.

Cite this Paper


BibTeX
@InProceedings{pmlr-v278-li25k, title = {HD-MF: Hierarchical Dynamic-aware Multimodal Fusion for Fine-Grained Bird Recognition}, author = {Li, Junjing and Liu, Xing and Luo, Jiu}, booktitle = {Proceedings of 2025 2nd International Conference on Machine Learning and Intelligent Computing}, pages = {604--613}, year = {2025}, editor = {Zeng, Nianyin and Pachori, Ram Bilas and Wang, Dongshu}, volume = {278}, series = {Proceedings of Machine Learning Research}, month = {25--27 Apr}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v278/main/assets/li25k/li25k.pdf}, url = {https://proceedings.mlr.press/v278/li25k.html}, abstract = {Fine-grained bird recognition plays a crucial role in biodiversity monitoring. Its primary challenge lies in identifying subtle inter-class visual differences and overcoming the inherent limitations of unimodal information. Audio provides crucial complementary cues, yet audiovisual fusion still faces challenges such as the semantic gap. To address these challenges, this paper proposed a hierarchical dynamic-aware multimodal fusion (HD-MF) architecture. This architecture captures locally aligned cross-modal features via its Cross-modal Spatial Interaction Module, extracts global high-order cross-modal correlations using the Factorized Bilinear Fusion Module, and dynamically integrates the outputs of these two fusion approaches through a Dynamically Adaptive Gated Fusion Unit. Evaluated on AViS, a paired audiovisual dataset constructed for this study, HD-MF achieved state-of-the-art performance. Experimental results demonstrated that HD-MF effectively integrates audiovisual complementary information, providing a novel and effective approach for enhancing fine-grained bird recognition performance.} }
Endnote
%0 Conference Paper %T HD-MF: Hierarchical Dynamic-aware Multimodal Fusion for Fine-Grained Bird Recognition %A Junjing Li %A Xing Liu %A Jiu Luo %B Proceedings of 2025 2nd International Conference on Machine Learning and Intelligent Computing %C Proceedings of Machine Learning Research %D 2025 %E Nianyin Zeng %E Ram Bilas Pachori %E Dongshu Wang %F pmlr-v278-li25k %I PMLR %P 604--613 %U https://proceedings.mlr.press/v278/li25k.html %V 278 %X Fine-grained bird recognition plays a crucial role in biodiversity monitoring. Its primary challenge lies in identifying subtle inter-class visual differences and overcoming the inherent limitations of unimodal information. Audio provides crucial complementary cues, yet audiovisual fusion still faces challenges such as the semantic gap. To address these challenges, this paper proposed a hierarchical dynamic-aware multimodal fusion (HD-MF) architecture. This architecture captures locally aligned cross-modal features via its Cross-modal Spatial Interaction Module, extracts global high-order cross-modal correlations using the Factorized Bilinear Fusion Module, and dynamically integrates the outputs of these two fusion approaches through a Dynamically Adaptive Gated Fusion Unit. Evaluated on AViS, a paired audiovisual dataset constructed for this study, HD-MF achieved state-of-the-art performance. Experimental results demonstrated that HD-MF effectively integrates audiovisual complementary information, providing a novel and effective approach for enhancing fine-grained bird recognition performance.
APA
Li, J., Liu, X. & Luo, J.. (2025). HD-MF: Hierarchical Dynamic-aware Multimodal Fusion for Fine-Grained Bird Recognition. Proceedings of 2025 2nd International Conference on Machine Learning and Intelligent Computing, in Proceedings of Machine Learning Research 278:604-613 Available from https://proceedings.mlr.press/v278/li25k.html.

Related Material