Contextual Linear Bandits with Delay as Payoff

Mengxiao Zhang; Yingfei Wang; Haipeng Luo

Contextual Linear Bandits with Delay as Payoff

Mengxiao Zhang, Yingfei Wang, Haipeng Luo

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:76246-76272, 2025.

Abstract

A recent work by Schlisselberg et al. (2025) studies a delay-as-payoff model for stochastic multi-armed bandits, where the payoff (either loss or reward) is delayed for a period that is proportional to the payoff. While this captures many real-world applications, the simple multi-armed bandit setting limits the practicality of their results. In this paper, we address this limitation by studying the delay-as-payoff model for contextual linear bandits. Specifically, we start from the case with a fixed action set and propose an efficient algorithm whose regret overhead compared to the standard no-delay case is only of order $D\Delta_{\max}\log T$, where $T$ is the total horizon, $D$ is the maximum delay, and $\Delta_{\max}$ is the maximum suboptimality gap. When payoff is loss, we also show further improvement of the bound, demonstrating a separation between reward and loss similar to Schlisselberg et al. (2025). Contrary to standard linear bandit algorithms that construct least squares estimator and confidence ellipsoid, the main novelty of our algorithm is to apply a phased arm elimination procedure by only picking the volumetric spanners of the action set, which addresses challenges arising from both payoff-dependent delays and large action sets. We further extend our results to the case with varying action sets by adopting the reduction from Hanna et al. (2023). Finally, we implement our algorithm and showcase its effectiveness and superior performance in experiments.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-zhang25ca,
  title = 	 {Contextual Linear Bandits with Delay as Payoff},
  author =       {Zhang, Mengxiao and Wang, Yingfei and Luo, Haipeng},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {76246--76272},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/zhang25ca/zhang25ca.pdf},
  url = 	 {https://proceedings.mlr.press/v267/zhang25ca.html},
  abstract = 	 {A recent work by Schlisselberg et al. (2025) studies a delay-as-payoff model for stochastic multi-armed bandits, where the payoff (either loss or reward) is delayed for a period that is proportional to the payoff. While this captures many real-world applications, the simple multi-armed bandit setting limits the practicality of their results. In this paper, we address this limitation by studying the delay-as-payoff model for contextual linear bandits. Specifically, we start from the case with a fixed action set and propose an efficient algorithm whose regret overhead compared to the standard no-delay case is only of order $D\Delta_{\max}\log T$, where $T$ is the total horizon, $D$ is the maximum delay, and $\Delta_{\max}$ is the maximum suboptimality gap. When payoff is loss, we also show further improvement of the bound, demonstrating a separation between reward and loss similar to Schlisselberg et al. (2025). Contrary to standard linear bandit algorithms that construct least squares estimator and confidence ellipsoid, the main novelty of our algorithm is to apply a phased arm elimination procedure by only picking the volumetric spanners of the action set, which addresses challenges arising from both payoff-dependent delays and large action sets. We further extend our results to the case with varying action sets by adopting the reduction from Hanna et al. (2023). Finally, we implement our algorithm and showcase its effectiveness and superior performance in experiments.}
}

Endnote

%0 Conference Paper
%T Contextual Linear Bandits with Delay as Payoff
%A Mengxiao Zhang
%A Yingfei Wang
%A Haipeng Luo
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-zhang25ca
%I PMLR
%P 76246--76272
%U https://proceedings.mlr.press/v267/zhang25ca.html
%V 267
%X A recent work by Schlisselberg et al. (2025) studies a delay-as-payoff model for stochastic multi-armed bandits, where the payoff (either loss or reward) is delayed for a period that is proportional to the payoff. While this captures many real-world applications, the simple multi-armed bandit setting limits the practicality of their results. In this paper, we address this limitation by studying the delay-as-payoff model for contextual linear bandits. Specifically, we start from the case with a fixed action set and propose an efficient algorithm whose regret overhead compared to the standard no-delay case is only of order $D\Delta_{\max}\log T$, where $T$ is the total horizon, $D$ is the maximum delay, and $\Delta_{\max}$ is the maximum suboptimality gap. When payoff is loss, we also show further improvement of the bound, demonstrating a separation between reward and loss similar to Schlisselberg et al. (2025). Contrary to standard linear bandit algorithms that construct least squares estimator and confidence ellipsoid, the main novelty of our algorithm is to apply a phased arm elimination procedure by only picking the volumetric spanners of the action set, which addresses challenges arising from both payoff-dependent delays and large action sets. We further extend our results to the case with varying action sets by adopting the reduction from Hanna et al. (2023). Finally, we implement our algorithm and showcase its effectiveness and superior performance in experiments.

APA

Zhang, M., Wang, Y. & Luo, H.. (2025). Contextual Linear Bandits with Delay as Payoff. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:76246-76272 Available from https://proceedings.mlr.press/v267/zhang25ca.html.

Contextual Linear Bandits with Delay as Payoff

Abstract

Cite this Paper

Related Material