Multi-Modal Object Re-identification via Sparse Mixture-of-Experts

Yingying Feng, Jie Li, Chi Xie, Lei Tan, Jiayi Ji
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:16796-16807, 2025.

Abstract

We present MFRNet, a novel network for multi-modal object re-identification that integrates multi-modal data features to effectively retrieve specific objects across different modalities. Current methods suffer from two principal limitations: (1) insufficient interaction between pixel-level semantic features across modalities, and (2) difficulty in balancing modality-shared and modality-specific features within a unified architecture. To address these challenges, our network introduces two core components. First, the Feature Fusion Module (FFM) enables fine-grained pixel-level feature generation and flexible cross-modal interaction. Second, the Feature Representation Module (FRM) efficiently extracts and combines modality-specific and modality-shared features, achieving strong discriminative ability with minimal parameter overhead. Extensive experiments on three challenging public datasets (RGBNT201, RGBNT100, and MSVR310) demonstrate the superiority of our approach in terms of both accuracy and efficiency, with 8.4% mAP and 6.9% accuracy improved in RGBNT201 with negligible additional parameters.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-feng25i, title = {Multi-Modal Object Re-identification via Sparse Mixture-of-Experts}, author = {Feng, Yingying and Li, Jie and Xie, Chi and Tan, Lei and Ji, Jiayi}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {16796--16807}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/feng25i/feng25i.pdf}, url = {https://proceedings.mlr.press/v267/feng25i.html}, abstract = {We present MFRNet, a novel network for multi-modal object re-identification that integrates multi-modal data features to effectively retrieve specific objects across different modalities. Current methods suffer from two principal limitations: (1) insufficient interaction between pixel-level semantic features across modalities, and (2) difficulty in balancing modality-shared and modality-specific features within a unified architecture. To address these challenges, our network introduces two core components. First, the Feature Fusion Module (FFM) enables fine-grained pixel-level feature generation and flexible cross-modal interaction. Second, the Feature Representation Module (FRM) efficiently extracts and combines modality-specific and modality-shared features, achieving strong discriminative ability with minimal parameter overhead. Extensive experiments on three challenging public datasets (RGBNT201, RGBNT100, and MSVR310) demonstrate the superiority of our approach in terms of both accuracy and efficiency, with 8.4% mAP and 6.9% accuracy improved in RGBNT201 with negligible additional parameters.} }
Endnote
%0 Conference Paper %T Multi-Modal Object Re-identification via Sparse Mixture-of-Experts %A Yingying Feng %A Jie Li %A Chi Xie %A Lei Tan %A Jiayi Ji %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-feng25i %I PMLR %P 16796--16807 %U https://proceedings.mlr.press/v267/feng25i.html %V 267 %X We present MFRNet, a novel network for multi-modal object re-identification that integrates multi-modal data features to effectively retrieve specific objects across different modalities. Current methods suffer from two principal limitations: (1) insufficient interaction between pixel-level semantic features across modalities, and (2) difficulty in balancing modality-shared and modality-specific features within a unified architecture. To address these challenges, our network introduces two core components. First, the Feature Fusion Module (FFM) enables fine-grained pixel-level feature generation and flexible cross-modal interaction. Second, the Feature Representation Module (FRM) efficiently extracts and combines modality-specific and modality-shared features, achieving strong discriminative ability with minimal parameter overhead. Extensive experiments on three challenging public datasets (RGBNT201, RGBNT100, and MSVR310) demonstrate the superiority of our approach in terms of both accuracy and efficiency, with 8.4% mAP and 6.9% accuracy improved in RGBNT201 with negligible additional parameters.
APA
Feng, Y., Li, J., Xie, C., Tan, L. & Ji, J.. (2025). Multi-Modal Object Re-identification via Sparse Mixture-of-Experts. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:16796-16807 Available from https://proceedings.mlr.press/v267/feng25i.html.

Related Material