Learning Deep Semantic Embeddings for Cross-Modal Retrieval

Cuicui Kang, Shengcai Liao, Zhen Li, Zigang Cao, Gang Xiong
Proceedings of the Ninth Asian Conference on Machine Learning, PMLR 77:471-486, 2017.

Abstract

Deep learning methods have been actively researched for cross-modal retrieval, with the softmax cross-entropy loss commonly applied for supervised learning. However, the softmax cross-entropy loss is known to result in large intra-class variances, which is not not very suited for cross-modal matching. In this paper, a deep architecture called Deep Semantic Embedding (DSE) is proposed, which is trained in an end-to-end manner for image-text cross-modal retrieval. With images and texts mapped to a feature embedding space, class labels are used to guide the embedding learning, so that the embedding space has a semantic meaning common for both images and texts. This way, the difference between different modalities is eliminated. Under this framework, the center loss is introduced beyond the commonly used softmax cross-entropy loss to achieve both inter-class separation and intra-class compactness. Besides, a distance based softmax cross-entropy loss is proposed to jointly consider the softmax cross-entropy and center losses in fully gradient based learning. Experiments have been done on three popular image-text cross-modal retrieval databases, showing that the proposed algorithms have achieved the best overall performances.

Cite this Paper


BibTeX
@InProceedings{pmlr-v77-kang17a, title = {Learning Deep Semantic Embeddings for Cross-Modal Retrieval}, author = {Kang, Cuicui and Liao, Shengcai and Li, Zhen and Cao, Zigang and Xiong, Gang}, booktitle = {Proceedings of the Ninth Asian Conference on Machine Learning}, pages = {471--486}, year = {2017}, editor = {Zhang, Min-Ling and Noh, Yung-Kyun}, volume = {77}, series = {Proceedings of Machine Learning Research}, address = {Yonsei University, Seoul, Republic of Korea}, month = {15--17 Nov}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v77/kang17a/kang17a.pdf}, url = {https://proceedings.mlr.press/v77/kang17a.html}, abstract = {Deep learning methods have been actively researched for cross-modal retrieval, with the softmax cross-entropy loss commonly applied for supervised learning. However, the softmax cross-entropy loss is known to result in large intra-class variances, which is not not very suited for cross-modal matching. In this paper, a deep architecture called Deep Semantic Embedding (DSE) is proposed, which is trained in an end-to-end manner for image-text cross-modal retrieval. With images and texts mapped to a feature embedding space, class labels are used to guide the embedding learning, so that the embedding space has a semantic meaning common for both images and texts. This way, the difference between different modalities is eliminated. Under this framework, the center loss is introduced beyond the commonly used softmax cross-entropy loss to achieve both inter-class separation and intra-class compactness. Besides, a distance based softmax cross-entropy loss is proposed to jointly consider the softmax cross-entropy and center losses in fully gradient based learning. Experiments have been done on three popular image-text cross-modal retrieval databases, showing that the proposed algorithms have achieved the best overall performances.} }
Endnote
%0 Conference Paper %T Learning Deep Semantic Embeddings for Cross-Modal Retrieval %A Cuicui Kang %A Shengcai Liao %A Zhen Li %A Zigang Cao %A Gang Xiong %B Proceedings of the Ninth Asian Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2017 %E Min-Ling Zhang %E Yung-Kyun Noh %F pmlr-v77-kang17a %I PMLR %P 471--486 %U https://proceedings.mlr.press/v77/kang17a.html %V 77 %X Deep learning methods have been actively researched for cross-modal retrieval, with the softmax cross-entropy loss commonly applied for supervised learning. However, the softmax cross-entropy loss is known to result in large intra-class variances, which is not not very suited for cross-modal matching. In this paper, a deep architecture called Deep Semantic Embedding (DSE) is proposed, which is trained in an end-to-end manner for image-text cross-modal retrieval. With images and texts mapped to a feature embedding space, class labels are used to guide the embedding learning, so that the embedding space has a semantic meaning common for both images and texts. This way, the difference between different modalities is eliminated. Under this framework, the center loss is introduced beyond the commonly used softmax cross-entropy loss to achieve both inter-class separation and intra-class compactness. Besides, a distance based softmax cross-entropy loss is proposed to jointly consider the softmax cross-entropy and center losses in fully gradient based learning. Experiments have been done on three popular image-text cross-modal retrieval databases, showing that the proposed algorithms have achieved the best overall performances.
APA
Kang, C., Liao, S., Li, Z., Cao, Z. & Xiong, G.. (2017). Learning Deep Semantic Embeddings for Cross-Modal Retrieval. Proceedings of the Ninth Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 77:471-486 Available from https://proceedings.mlr.press/v77/kang17a.html.

Related Material