[edit]
HD-MF: Hierarchical Dynamic-aware Multimodal Fusion for Fine-Grained Bird Recognition
Proceedings of 2025 2nd International Conference on Machine Learning and Intelligent Computing, PMLR 278:604-613, 2025.
Abstract
Fine-grained bird recognition plays a crucial role in biodiversity monitoring. Its primary challenge lies in identifying subtle inter-class visual differences and overcoming the inherent limitations of unimodal information. Audio provides crucial complementary cues, yet audiovisual fusion still faces challenges such as the semantic gap. To address these challenges, this paper proposed a hierarchical dynamic-aware multimodal fusion (HD-MF) architecture. This architecture captures locally aligned cross-modal features via its Cross-modal Spatial Interaction Module, extracts global high-order cross-modal correlations using the Factorized Bilinear Fusion Module, and dynamically integrates the outputs of these two fusion approaches through a Dynamically Adaptive Gated Fusion Unit. Evaluated on AViS, a paired audiovisual dataset constructed for this study, HD-MF achieved state-of-the-art performance. Experimental results demonstrated that HD-MF effectively integrates audiovisual complementary information, providing a novel and effective approach for enhancing fine-grained bird recognition performance.