MTSTRec: Multimodal Time-Aligned Shared Token Recommender

Ming-Yi Hong; Yen-Jung Hsu; Miao-Chen Chiang; Che Lin

MTSTRec: Multimodal Time-Aligned Shared Token Recommender

Ming-Yi Hong, Yen-Jung Hsu, Miao-Chen Chiang, Che Lin

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:23640-23661, 2025.

Abstract

Sequential recommendation in e-commerce utilizes users’ anonymous browsing histories to personalize product suggestions without relying on private information. Existing item ID-based methods and multimodal models often overlook the temporal alignment of modalities like textual descriptions, visual content, and prices in user browsing sequences. To address this limitation, this paper proposes the Multimodal Time-aligned Shared Token Recommender (MTSTRec), a transformer-based framework with a single time-aligned shared token per product for efficient cross-modality fusion. MTSTRec preserves the distinct contributions of each modality while aligning them temporally to better capture user preferences. Extensive experiments demonstrate that MTSTRec achieves state-of-the-art performance across multiple sequential recommendation benchmarks, significantly improving upon existing multimodal fusion. Our code is available at https://github.com/idssplab/MTSTRec.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-hong25b,
  title = 	 {{MTSTR}ec: Multimodal Time-Aligned Shared Token Recommender},
  author =       {Hong, Ming-Yi and Hsu, Yen-Jung and Chiang, Miao-Chen and Lin, Che},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {23640--23661},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/hong25b/hong25b.pdf},
  url = 	 {https://proceedings.mlr.press/v267/hong25b.html},
  abstract = 	 {Sequential recommendation in e-commerce utilizes users’ anonymous browsing histories to personalize product suggestions without relying on private information. Existing item ID-based methods and multimodal models often overlook the temporal alignment of modalities like textual descriptions, visual content, and prices in user browsing sequences. To address this limitation, this paper proposes the Multimodal Time-aligned Shared Token Recommender (MTSTRec), a transformer-based framework with a single time-aligned shared token per product for efficient cross-modality fusion. MTSTRec preserves the distinct contributions of each modality while aligning them temporally to better capture user preferences. Extensive experiments demonstrate that MTSTRec achieves state-of-the-art performance across multiple sequential recommendation benchmarks, significantly improving upon existing multimodal fusion. Our code is available at https://github.com/idssplab/MTSTRec.}
}

Endnote

%0 Conference Paper
%T MTSTRec: Multimodal Time-Aligned Shared Token Recommender
%A Ming-Yi Hong
%A Yen-Jung Hsu
%A Miao-Chen Chiang
%A Che Lin
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-hong25b
%I PMLR
%P 23640--23661
%U https://proceedings.mlr.press/v267/hong25b.html
%V 267
%X Sequential recommendation in e-commerce utilizes users’ anonymous browsing histories to personalize product suggestions without relying on private information. Existing item ID-based methods and multimodal models often overlook the temporal alignment of modalities like textual descriptions, visual content, and prices in user browsing sequences. To address this limitation, this paper proposes the Multimodal Time-aligned Shared Token Recommender (MTSTRec), a transformer-based framework with a single time-aligned shared token per product for efficient cross-modality fusion. MTSTRec preserves the distinct contributions of each modality while aligning them temporally to better capture user preferences. Extensive experiments demonstrate that MTSTRec achieves state-of-the-art performance across multiple sequential recommendation benchmarks, significantly improving upon existing multimodal fusion. Our code is available at https://github.com/idssplab/MTSTRec.

APA

Hong, M., Hsu, Y., Chiang, M. & Lin, C.. (2025). MTSTRec: Multimodal Time-Aligned Shared Token Recommender. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:23640-23661 Available from https://proceedings.mlr.press/v267/hong25b.html.

MTSTRec: Multimodal Time-Aligned Shared Token Recommender

Abstract

Cite this Paper

Related Material