PISDR: Page and Item Sequential Decision for Re-ranking Based on Offline Reinforcement Learning

Zheng Yuan; Qian Wan; Tao Zhang; Chengfu Huo

PISDR: Page and Item Sequential Decision for Re-ranking Based on Offline Reinforcement Learning

Zheng Yuan, Qian Wan, Tao Zhang, Chengfu Huo

Proceedings of the 16th Asian Conference on Machine Learning, PMLR 260:829-844, 2025.

Abstract

Re-ranking is the last part of a multi-stage recommendation system, involving the reordering of lists based on historical user behavior to better align with user preferences. Offline Reinforcement Learning (RL) has been employed in both the prediction and ranking phases of recommendation systems to align with long-term objectives, surpassing the efficacy of supervised learning. However, extrapolation error is a common problem in offline RL, due to the biased distribution of features, which can lead to the reduction of recommendation accuracy. Consider that as users browse an e-commerce app, their preferences are influenced by previously recommended items or pages, therefore the history can be used to correct the bias of offline RL. This paper uses offline RL to model re-ranking and presents a re-ranking algorithm named Page and Item Sequential Decision for Re-ranking (PISDR) to improve accuracy by correcting bias at two levels (pages and items). PISDR employs sequential RL, leveraging a session-level data structure that encapsulates global information at the page level and item-level interrelationships. Additionally, PISDR utilizes a multi-tower critic network to assess various feedback metrics, including click-through rate, conversion rate, etc. which can raise actor network from the long-term reward. Experimental results validate the effectiveness of PISDR in significantly enhancing of Area Under Curve (AUC), Mean Average Precision (MAP) and Normalized Discounted Cumulative Gain (NDCG) about 1.4% in generated re-ranking sequences when compared to current state-of-the-art re-ranking algorithms. Finally, as a consequence, our method achieves a significant improvement (2.59%) in terms of Click-Through Rate (CTR) over the industrial-level ranking model in online A/B tests.

Cite this Paper

BibTeX

@InProceedings{pmlr-v260-yuan25a,
  title = 	 {{PISDR}: {P}age and Item Sequential Decision for Re-ranking Based on Offline Reinforcement Learning},
  author =       {Yuan, Zheng and Wan, Qian and Zhang, Tao and Huo, Chengfu},
  booktitle = 	 {Proceedings of the 16th Asian Conference on Machine Learning},
  pages = 	 {829--844},
  year = 	 {2025},
  editor = 	 {Nguyen, Vu and Lin, Hsuan-Tien},
  volume = 	 {260},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {05--08 Dec},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v260/main/assets/yuan25a/yuan25a.pdf},
  url = 	 {https://proceedings.mlr.press/v260/yuan25a.html},
  abstract = 	 {Re-ranking is the last part of a multi-stage recommendation system, involving the reordering of lists based on historical user behavior to better align with user preferences. Offline Reinforcement Learning (RL) has been employed in both the prediction and ranking phases of recommendation systems to align with long-term objectives, surpassing the efficacy of supervised learning.  However, extrapolation error is a common problem in offline RL, due to the biased distribution of features, which can lead to the reduction of recommendation accuracy.
 Consider that as users browse an e-commerce app, their preferences are influenced by previously recommended items or pages, therefore the history can be used to correct the bias of offline RL. This paper uses offline RL to model re-ranking and presents a re-ranking algorithm named Page and Item Sequential Decision for Re-ranking (PISDR) to improve accuracy by correcting bias at two levels (pages and items). PISDR employs sequential RL, leveraging a session-level data structure that encapsulates global information at the page level and item-level interrelationships. Additionally, PISDR utilizes a multi-tower critic network to assess various feedback metrics, including click-through rate, conversion rate, etc. which can raise actor network from the long-term reward. Experimental results validate the effectiveness of PISDR in significantly enhancing of Area Under Curve (AUC), Mean Average Precision (MAP) and Normalized Discounted Cumulative Gain (NDCG) about 1.4% in generated re-ranking sequences when compared to current state-of-the-art re-ranking algorithms. Finally, as a consequence, our method achieves a significant improvement (2.59%) in terms of Click-Through Rate (CTR) over the industrial-level ranking model in online A/B tests.}
}

Endnote

%0 Conference Paper
%T PISDR: Page and Item Sequential Decision for Re-ranking Based on Offline Reinforcement Learning
%A Zheng Yuan
%A Qian Wan
%A Tao Zhang
%A Chengfu Huo
%B Proceedings of the 16th Asian Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Vu Nguyen
%E Hsuan-Tien Lin	
%F pmlr-v260-yuan25a
%I PMLR
%P 829--844
%U https://proceedings.mlr.press/v260/yuan25a.html
%V 260
%X Re-ranking is the last part of a multi-stage recommendation system, involving the reordering of lists based on historical user behavior to better align with user preferences. Offline Reinforcement Learning (RL) has been employed in both the prediction and ranking phases of recommendation systems to align with long-term objectives, surpassing the efficacy of supervised learning.  However, extrapolation error is a common problem in offline RL, due to the biased distribution of features, which can lead to the reduction of recommendation accuracy.
 Consider that as users browse an e-commerce app, their preferences are influenced by previously recommended items or pages, therefore the history can be used to correct the bias of offline RL. This paper uses offline RL to model re-ranking and presents a re-ranking algorithm named Page and Item Sequential Decision for Re-ranking (PISDR) to improve accuracy by correcting bias at two levels (pages and items). PISDR employs sequential RL, leveraging a session-level data structure that encapsulates global information at the page level and item-level interrelationships. Additionally, PISDR utilizes a multi-tower critic network to assess various feedback metrics, including click-through rate, conversion rate, etc. which can raise actor network from the long-term reward. Experimental results validate the effectiveness of PISDR in significantly enhancing of Area Under Curve (AUC), Mean Average Precision (MAP) and Normalized Discounted Cumulative Gain (NDCG) about 1.4% in generated re-ranking sequences when compared to current state-of-the-art re-ranking algorithms. Finally, as a consequence, our method achieves a significant improvement (2.59%) in terms of Click-Through Rate (CTR) over the industrial-level ranking model in online A/B tests.

APA

Yuan, Z., Wan, Q., Zhang, T. & Huo, C.. (2025). PISDR: Page and Item Sequential Decision for Re-ranking Based on Offline Reinforcement Learning. Proceedings of the 16th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 260:829-844 Available from https://proceedings.mlr.press/v260/yuan25a.html.

PISDR: Page and Item Sequential Decision for Re-ranking Based on Offline Reinforcement Learning

Abstract

Cite this Paper

Related Material