Factored-Reward Bandits with Intermediate Observations

Marco Mussi, Simone Drago, Marcello Restelli, Alberto Maria Metelli
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:36911-36952, 2024.

Abstract

In several real-world sequential decision problems, at every step, the learner is required to select different actions. Every action affects a specific part of the system and generates an observable intermediate effect. In this paper, we introduce the Factored-Reward Bandits (FRBs), a novel setting able to effectively capture and exploit the structure of this class of scenarios, where the reward is computed as the product of the action intermediate observations. We characterize the statistical complexity of the learning problem in the FRBs, by deriving worst-case and asymptotic instance-dependent regret lower bounds. Then, we devise and analyze two regret minimization algorithms. The former, F-UCB, is an anytime optimistic approach matching the worst-case lower bound (up to logarithmic factors) but fails to perform optimally from the instance-dependent perspective. The latter, F-Track, is a bound-tracking approach, that enjoys optimal asymptotic instance-dependent regret guarantees.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-mussi24a, title = {Factored-Reward Bandits with Intermediate Observations}, author = {Mussi, Marco and Drago, Simone and Restelli, Marcello and Metelli, Alberto Maria}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {36911--36952}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/mussi24a/mussi24a.pdf}, url = {https://proceedings.mlr.press/v235/mussi24a.html}, abstract = {In several real-world sequential decision problems, at every step, the learner is required to select different actions. Every action affects a specific part of the system and generates an observable intermediate effect. In this paper, we introduce the Factored-Reward Bandits (FRBs), a novel setting able to effectively capture and exploit the structure of this class of scenarios, where the reward is computed as the product of the action intermediate observations. We characterize the statistical complexity of the learning problem in the FRBs, by deriving worst-case and asymptotic instance-dependent regret lower bounds. Then, we devise and analyze two regret minimization algorithms. The former, F-UCB, is an anytime optimistic approach matching the worst-case lower bound (up to logarithmic factors) but fails to perform optimally from the instance-dependent perspective. The latter, F-Track, is a bound-tracking approach, that enjoys optimal asymptotic instance-dependent regret guarantees.} }
Endnote
%0 Conference Paper %T Factored-Reward Bandits with Intermediate Observations %A Marco Mussi %A Simone Drago %A Marcello Restelli %A Alberto Maria Metelli %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-mussi24a %I PMLR %P 36911--36952 %U https://proceedings.mlr.press/v235/mussi24a.html %V 235 %X In several real-world sequential decision problems, at every step, the learner is required to select different actions. Every action affects a specific part of the system and generates an observable intermediate effect. In this paper, we introduce the Factored-Reward Bandits (FRBs), a novel setting able to effectively capture and exploit the structure of this class of scenarios, where the reward is computed as the product of the action intermediate observations. We characterize the statistical complexity of the learning problem in the FRBs, by deriving worst-case and asymptotic instance-dependent regret lower bounds. Then, we devise and analyze two regret minimization algorithms. The former, F-UCB, is an anytime optimistic approach matching the worst-case lower bound (up to logarithmic factors) but fails to perform optimally from the instance-dependent perspective. The latter, F-Track, is a bound-tracking approach, that enjoys optimal asymptotic instance-dependent regret guarantees.
APA
Mussi, M., Drago, S., Restelli, M. & Metelli, A.M.. (2024). Factored-Reward Bandits with Intermediate Observations. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:36911-36952 Available from https://proceedings.mlr.press/v235/mussi24a.html.

Related Material