SwAMP: Swapped Assignment of Multi-Modal Pairs for Cross-Modal Retrieval

Minyoung Kim
Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, PMLR 206:9167-9190, 2023.

Abstract

We tackle the cross-modal retrieval problem, where learning is only supervised by the relevant multi-modal pairs in the data. Although the contrastive learning is the most popular approach for this task, it makes potentially wrong assumption that the instances in different pairs are automatically irrelevant. To address the issue, we propose a novel loss function that is based on self-labeling of the unknown semantic classes. Specifically, we aim to predict class labels of the data instances in each modality, and assign those labels to the corresponding instances in the other modality (i.e., swapping the pseudo labels). With these swapped labels, we learn the data embedding for each modality using the supervised cross-entropy loss. This way, cross-modal instances from different pairs that are semantically related can be aligned to each other by the class predictor. We tested our approach on several real-world cross-modal retrieval problems, including text-based video retrieval, sketch-based image retrieval, and image-text retrieval. For all these tasks our method achieves significant performance improvement over the contrastive learning.

Cite this Paper


BibTeX
@InProceedings{pmlr-v206-kim23e, title = {SwAMP: Swapped Assignment of Multi-Modal Pairs for Cross-Modal Retrieval}, author = {Kim, Minyoung}, booktitle = {Proceedings of The 26th International Conference on Artificial Intelligence and Statistics}, pages = {9167--9190}, year = {2023}, editor = {Ruiz, Francisco and Dy, Jennifer and van de Meent, Jan-Willem}, volume = {206}, series = {Proceedings of Machine Learning Research}, month = {25--27 Apr}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v206/kim23e/kim23e.pdf}, url = {https://proceedings.mlr.press/v206/kim23e.html}, abstract = {We tackle the cross-modal retrieval problem, where learning is only supervised by the relevant multi-modal pairs in the data. Although the contrastive learning is the most popular approach for this task, it makes potentially wrong assumption that the instances in different pairs are automatically irrelevant. To address the issue, we propose a novel loss function that is based on self-labeling of the unknown semantic classes. Specifically, we aim to predict class labels of the data instances in each modality, and assign those labels to the corresponding instances in the other modality (i.e., swapping the pseudo labels). With these swapped labels, we learn the data embedding for each modality using the supervised cross-entropy loss. This way, cross-modal instances from different pairs that are semantically related can be aligned to each other by the class predictor. We tested our approach on several real-world cross-modal retrieval problems, including text-based video retrieval, sketch-based image retrieval, and image-text retrieval. For all these tasks our method achieves significant performance improvement over the contrastive learning.} }
Endnote
%0 Conference Paper %T SwAMP: Swapped Assignment of Multi-Modal Pairs for Cross-Modal Retrieval %A Minyoung Kim %B Proceedings of The 26th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2023 %E Francisco Ruiz %E Jennifer Dy %E Jan-Willem van de Meent %F pmlr-v206-kim23e %I PMLR %P 9167--9190 %U https://proceedings.mlr.press/v206/kim23e.html %V 206 %X We tackle the cross-modal retrieval problem, where learning is only supervised by the relevant multi-modal pairs in the data. Although the contrastive learning is the most popular approach for this task, it makes potentially wrong assumption that the instances in different pairs are automatically irrelevant. To address the issue, we propose a novel loss function that is based on self-labeling of the unknown semantic classes. Specifically, we aim to predict class labels of the data instances in each modality, and assign those labels to the corresponding instances in the other modality (i.e., swapping the pseudo labels). With these swapped labels, we learn the data embedding for each modality using the supervised cross-entropy loss. This way, cross-modal instances from different pairs that are semantically related can be aligned to each other by the class predictor. We tested our approach on several real-world cross-modal retrieval problems, including text-based video retrieval, sketch-based image retrieval, and image-text retrieval. For all these tasks our method achieves significant performance improvement over the contrastive learning.
APA
Kim, M.. (2023). SwAMP: Swapped Assignment of Multi-Modal Pairs for Cross-Modal Retrieval. Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 206:9167-9190 Available from https://proceedings.mlr.press/v206/kim23e.html.

Related Material