Fusing Reward and Dueling Feedback in Stochastic Bandits

Xuchuang Wang, Qirun Zeng, Jinhang Zuo, Xutong Liu, Mohammad Hajiesmaili, John C.S. Lui, Adam Wierman
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:65346-65367, 2025.

Abstract

This paper investigates the fusion of absolute (reward) and relative (dueling) feedback in stochastic bandits, where both feedback types are gathered in each decision round. We derive a regret lower bound, demonstrating that an efficient algorithm may incur only the smaller among the reward and dueling-based regret for each individual arm. We propose two fusion approaches: (1) a simple elimination fusion algorithm that leverages both feedback types to explore all arms and unifies collected information by sharing a common candidate arm set, and (2) a decomposition fusion algorithm that selects the more effective feedback to explore the corresponding arms and randomly assigns one feedback type for exploration and the other for exploitation in each round. The elimination fusion experiences a suboptimal multiplicative term of the number of arms in regret due to the intrinsic suboptimality of dueling elimination. In contrast, the decomposition fusion achieves regret matching the lower bound up to a constant under a common assumption. Extensive experiments confirm the efficacy of our algorithms and theoretical results.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-wang25eh, title = {Fusing Reward and Dueling Feedback in Stochastic Bandits}, author = {Wang, Xuchuang and Zeng, Qirun and Zuo, Jinhang and Liu, Xutong and Hajiesmaili, Mohammad and Lui, John C.S. and Wierman, Adam}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {65346--65367}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/wang25eh/wang25eh.pdf}, url = {https://proceedings.mlr.press/v267/wang25eh.html}, abstract = {This paper investigates the fusion of absolute (reward) and relative (dueling) feedback in stochastic bandits, where both feedback types are gathered in each decision round. We derive a regret lower bound, demonstrating that an efficient algorithm may incur only the smaller among the reward and dueling-based regret for each individual arm. We propose two fusion approaches: (1) a simple elimination fusion algorithm that leverages both feedback types to explore all arms and unifies collected information by sharing a common candidate arm set, and (2) a decomposition fusion algorithm that selects the more effective feedback to explore the corresponding arms and randomly assigns one feedback type for exploration and the other for exploitation in each round. The elimination fusion experiences a suboptimal multiplicative term of the number of arms in regret due to the intrinsic suboptimality of dueling elimination. In contrast, the decomposition fusion achieves regret matching the lower bound up to a constant under a common assumption. Extensive experiments confirm the efficacy of our algorithms and theoretical results.} }
Endnote
%0 Conference Paper %T Fusing Reward and Dueling Feedback in Stochastic Bandits %A Xuchuang Wang %A Qirun Zeng %A Jinhang Zuo %A Xutong Liu %A Mohammad Hajiesmaili %A John C.S. Lui %A Adam Wierman %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-wang25eh %I PMLR %P 65346--65367 %U https://proceedings.mlr.press/v267/wang25eh.html %V 267 %X This paper investigates the fusion of absolute (reward) and relative (dueling) feedback in stochastic bandits, where both feedback types are gathered in each decision round. We derive a regret lower bound, demonstrating that an efficient algorithm may incur only the smaller among the reward and dueling-based regret for each individual arm. We propose two fusion approaches: (1) a simple elimination fusion algorithm that leverages both feedback types to explore all arms and unifies collected information by sharing a common candidate arm set, and (2) a decomposition fusion algorithm that selects the more effective feedback to explore the corresponding arms and randomly assigns one feedback type for exploration and the other for exploitation in each round. The elimination fusion experiences a suboptimal multiplicative term of the number of arms in regret due to the intrinsic suboptimality of dueling elimination. In contrast, the decomposition fusion achieves regret matching the lower bound up to a constant under a common assumption. Extensive experiments confirm the efficacy of our algorithms and theoretical results.
APA
Wang, X., Zeng, Q., Zuo, J., Liu, X., Hajiesmaili, M., Lui, J.C. & Wierman, A.. (2025). Fusing Reward and Dueling Feedback in Stochastic Bandits. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:65346-65367 Available from https://proceedings.mlr.press/v267/wang25eh.html.

Related Material