Offline Learning for Combinatorial Multi-armed Bandits

Xutong Liu, Xiangxiang Dai, Jinhang Zuo, Siwei Wang, Carlee Joe-Wong, John C.S. Lui, Wei Chen
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:38251-38289, 2025.

Abstract

The combinatorial multi-armed bandit (CMAB) is a fundamental sequential decision-making framework, extensively studied over the past decade. However, existing work primarily focuses on the online setting, overlooking the substantial costs of online interactions and the readily available offline datasets. To overcome these limitations, we introduce Off-CMAB, the first offline learning framework for CMAB. Central to our framework is the combinatorial lower confidence bound (CLCB) algorithm, which combines pessimistic reward estimations with combinatorial solvers. To characterize the quality of offline datasets, we propose two novel data coverage conditions and prove that, under these conditions, CLCB achieves a near-optimal suboptimality gap, matching the theoretical lower bound up to a logarithmic factor. We validate Off-CMAB through practical applications, including learning to rank, large language model (LLM) caching, and social influence maximization, showing its ability to handle nonlinear reward functions, general feedback models, and out-of-distribution action samples that excludes optimal or even feasible actions. Extensive experiments on synthetic and real-world datasets further highlight the superior performance of CLCB.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-liu25i, title = {Offline Learning for Combinatorial Multi-armed Bandits}, author = {Liu, Xutong and Dai, Xiangxiang and Zuo, Jinhang and Wang, Siwei and Joe-Wong, Carlee and Lui, John C.S. and Chen, Wei}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {38251--38289}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/liu25i/liu25i.pdf}, url = {https://proceedings.mlr.press/v267/liu25i.html}, abstract = {The combinatorial multi-armed bandit (CMAB) is a fundamental sequential decision-making framework, extensively studied over the past decade. However, existing work primarily focuses on the online setting, overlooking the substantial costs of online interactions and the readily available offline datasets. To overcome these limitations, we introduce Off-CMAB, the first offline learning framework for CMAB. Central to our framework is the combinatorial lower confidence bound (CLCB) algorithm, which combines pessimistic reward estimations with combinatorial solvers. To characterize the quality of offline datasets, we propose two novel data coverage conditions and prove that, under these conditions, CLCB achieves a near-optimal suboptimality gap, matching the theoretical lower bound up to a logarithmic factor. We validate Off-CMAB through practical applications, including learning to rank, large language model (LLM) caching, and social influence maximization, showing its ability to handle nonlinear reward functions, general feedback models, and out-of-distribution action samples that excludes optimal or even feasible actions. Extensive experiments on synthetic and real-world datasets further highlight the superior performance of CLCB.} }
Endnote
%0 Conference Paper %T Offline Learning for Combinatorial Multi-armed Bandits %A Xutong Liu %A Xiangxiang Dai %A Jinhang Zuo %A Siwei Wang %A Carlee Joe-Wong %A John C.S. Lui %A Wei Chen %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-liu25i %I PMLR %P 38251--38289 %U https://proceedings.mlr.press/v267/liu25i.html %V 267 %X The combinatorial multi-armed bandit (CMAB) is a fundamental sequential decision-making framework, extensively studied over the past decade. However, existing work primarily focuses on the online setting, overlooking the substantial costs of online interactions and the readily available offline datasets. To overcome these limitations, we introduce Off-CMAB, the first offline learning framework for CMAB. Central to our framework is the combinatorial lower confidence bound (CLCB) algorithm, which combines pessimistic reward estimations with combinatorial solvers. To characterize the quality of offline datasets, we propose two novel data coverage conditions and prove that, under these conditions, CLCB achieves a near-optimal suboptimality gap, matching the theoretical lower bound up to a logarithmic factor. We validate Off-CMAB through practical applications, including learning to rank, large language model (LLM) caching, and social influence maximization, showing its ability to handle nonlinear reward functions, general feedback models, and out-of-distribution action samples that excludes optimal or even feasible actions. Extensive experiments on synthetic and real-world datasets further highlight the superior performance of CLCB.
APA
Liu, X., Dai, X., Zuo, J., Wang, S., Joe-Wong, C., Lui, J.C. & Chen, W.. (2025). Offline Learning for Combinatorial Multi-armed Bandits. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:38251-38289 Available from https://proceedings.mlr.press/v267/liu25i.html.

Related Material