An Empirical Study of Attention-Based Cross-Modal Retrieval for Movies

Mohamed Elrfaey, Thomas Aaron Gulliver
Proceedings of the The 39th Canadian Conference on Artificial Intelligence, PMLR 318:848-855, 2026.

Abstract

We present an empirical study of attention-based cross-modal retrieval for movies. Our approach combines text overviews, poster images, and trailer thumbnails using a cross-attention fusion module to learn unified item representations. To support this study, we augment MovieLens 1M with metadata from The Movie Database (TMDB), including overview text, poster images, and static trailer thumbnails. We evaluate text-only, image-only, and fused representations on top-K retrieval metrics, and compare them with interaction-only baselines based on Bayesian Personalized Ranking (BPR) and LightGCN. The results show that image-only retrieval achieves the strongest Recall@K and NDCG@K performance, while the fused model produces qualitatively more semantically balanced recommendations but does not outperform the strongest unimodal baseline. These findings suggest that attention-based multimodal fusion can improve recommendation coherence and interpretability, while also highlighting the challenge of translating cross-modal signals into stronger ranking performance.

Cite this Paper


BibTeX
@InProceedings{pmlr-v318-elrfaey26a, title = {An Empirical Study of Attention-Based Cross-Modal Retrieval for Movies}, author = {Elrfaey, Mohamed and Gulliver, Thomas Aaron}, booktitle = {Proceedings of the The 39th Canadian Conference on Artificial Intelligence}, pages = {848--855}, year = {2026}, editor = {Bouzar-Benlabiod, Lydia and Leung, Carson}, volume = {318}, series = {Proceedings of Machine Learning Research}, month = {25--29 May}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v318/main/assets/elrfaey26a/elrfaey26a.pdf}, url = {https://proceedings.mlr.press/v318/elrfaey26a.html}, abstract = {We present an empirical study of attention-based cross-modal retrieval for movies. Our approach combines text overviews, poster images, and trailer thumbnails using a cross-attention fusion module to learn unified item representations. To support this study, we augment MovieLens 1M with metadata from The Movie Database (TMDB), including overview text, poster images, and static trailer thumbnails. We evaluate text-only, image-only, and fused representations on top-K retrieval metrics, and compare them with interaction-only baselines based on Bayesian Personalized Ranking (BPR) and LightGCN. The results show that image-only retrieval achieves the strongest Recall@K and NDCG@K performance, while the fused model produces qualitatively more semantically balanced recommendations but does not outperform the strongest unimodal baseline. These findings suggest that attention-based multimodal fusion can improve recommendation coherence and interpretability, while also highlighting the challenge of translating cross-modal signals into stronger ranking performance.} }
Endnote
%0 Conference Paper %T An Empirical Study of Attention-Based Cross-Modal Retrieval for Movies %A Mohamed Elrfaey %A Thomas Aaron Gulliver %B Proceedings of the The 39th Canadian Conference on Artificial Intelligence %C Proceedings of Machine Learning Research %D 2026 %E Lydia Bouzar-Benlabiod %E Carson Leung %F pmlr-v318-elrfaey26a %I PMLR %P 848--855 %U https://proceedings.mlr.press/v318/elrfaey26a.html %V 318 %X We present an empirical study of attention-based cross-modal retrieval for movies. Our approach combines text overviews, poster images, and trailer thumbnails using a cross-attention fusion module to learn unified item representations. To support this study, we augment MovieLens 1M with metadata from The Movie Database (TMDB), including overview text, poster images, and static trailer thumbnails. We evaluate text-only, image-only, and fused representations on top-K retrieval metrics, and compare them with interaction-only baselines based on Bayesian Personalized Ranking (BPR) and LightGCN. The results show that image-only retrieval achieves the strongest Recall@K and NDCG@K performance, while the fused model produces qualitatively more semantically balanced recommendations but does not outperform the strongest unimodal baseline. These findings suggest that attention-based multimodal fusion can improve recommendation coherence and interpretability, while also highlighting the challenge of translating cross-modal signals into stronger ranking performance.
APA
Elrfaey, M. & Gulliver, T.A.. (2026). An Empirical Study of Attention-Based Cross-Modal Retrieval for Movies. Proceedings of the The 39th Canadian Conference on Artificial Intelligence, in Proceedings of Machine Learning Research 318:848-855 Available from https://proceedings.mlr.press/v318/elrfaey26a.html.

Related Material