[edit]
An Empirical Study of Attention-Based Cross-Modal Retrieval for Movies
Proceedings of the The 39th Canadian Conference on Artificial Intelligence, PMLR 318:848-855, 2026.
Abstract
We present an empirical study of attention-based cross-modal retrieval for movies. Our approach combines text overviews, poster images, and trailer thumbnails using a cross-attention fusion module to learn unified item representations. To support this study, we augment MovieLens 1M with metadata from The Movie Database (TMDB), including overview text, poster images, and static trailer thumbnails. We evaluate text-only, image-only, and fused representations on top-K retrieval metrics, and compare them with interaction-only baselines based on Bayesian Personalized Ranking (BPR) and LightGCN. The results show that image-only retrieval achieves the strongest Recall@K and NDCG@K performance, while the fused model produces qualitatively more semantically balanced recommendations but does not outperform the strongest unimodal baseline. These findings suggest that attention-based multimodal fusion can improve recommendation coherence and interpretability, while also highlighting the challenge of translating cross-modal signals into stronger ranking performance.