Sleeping Reinforcement Learning

Simone Drago; Marco Mussi; Alberto Maria Metelli

Sleeping Reinforcement Learning

Simone Drago, Marco Mussi, Alberto Maria Metelli

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:14439-14498, 2025.

Abstract

In the standard Reinforcement Learning (RL) paradigm, the action space is assumed to be fixed and immutable throughout the learning process. However, in many real-world scenarios, not all actions are available at every decision stage. The available action set may depend on the current environment state, domain-specific constraints, or other (potentially stochastic) factors outside the agent’s control. To address these realistic scenarios, we introduce a novel paradigm called Sleeping Reinforcement Learning, where the available action set varies during the interaction with the environment. We start with the simpler scenario in which the available action sets are revealed at the beginning of each episode. We show that a modification of UCBVI achieves regret of order $\widetilde{\mathcal{O}}(H\sqrt{SAT})$, where $H$ is the horizon, $S$ and $A$ are the cardinalities of the state and action spaces, respectively, and $T$ is the learning horizon. Next, we address the more challenging and realistic scenario in which the available actions are disclosed only at each decision stage. By leveraging a novel construction, we establish a minimax lower bound of order $\Omega(\sqrt{T 2^{A/2}})$ when the availability of actions is governed by a Markovian process, establishing a statistical barrier of the problem. Focusing on the statistically tractable case where action availability depends only on the current state and stage, we propose a new optimistic algorithm that achieves regret guarantees of order $\widetilde{\mathcal{O}}(H\sqrt{SAT})$, showing that the problem shares the same complexity of standard RL.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-drago25a,
  title = 	 {Sleeping Reinforcement Learning},
  author =       {Drago, Simone and Mussi, Marco and Metelli, Alberto Maria},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {14439--14498},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/drago25a/drago25a.pdf},
  url = 	 {https://proceedings.mlr.press/v267/drago25a.html},
  abstract = 	 {In the standard Reinforcement Learning (RL) paradigm, the action space is assumed to be fixed and immutable throughout the learning process. However, in many real-world scenarios, not all actions are available at every decision stage. The available action set may depend on the current environment state, domain-specific constraints, or other (potentially stochastic) factors outside the agent’s control. To address these realistic scenarios, we introduce a novel paradigm called Sleeping Reinforcement Learning, where the available action set varies during the interaction with the environment. We start with the simpler scenario in which the available action sets are revealed at the beginning of each episode. We show that a modification of UCBVI achieves regret of order $\widetilde{\mathcal{O}}(H\sqrt{SAT})$, where $H$ is the horizon, $S$ and $A$ are the cardinalities of the state and action spaces, respectively, and $T$ is the learning horizon. Next, we address the more challenging and realistic scenario in which the available actions are disclosed only at each decision stage. By leveraging a novel construction, we establish a minimax lower bound of order $\Omega(\sqrt{T 2^{A/2}})$ when the availability of actions is governed by a Markovian process, establishing a statistical barrier of the problem. Focusing on the statistically tractable case where action availability depends only on the current state and stage, we propose a new optimistic algorithm that achieves regret guarantees of order $\widetilde{\mathcal{O}}(H\sqrt{SAT})$, showing that the problem shares the same complexity of standard RL.}
}

Endnote

%0 Conference Paper
%T Sleeping Reinforcement Learning
%A Simone Drago
%A Marco Mussi
%A Alberto Maria Metelli
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-drago25a
%I PMLR
%P 14439--14498
%U https://proceedings.mlr.press/v267/drago25a.html
%V 267
%X In the standard Reinforcement Learning (RL) paradigm, the action space is assumed to be fixed and immutable throughout the learning process. However, in many real-world scenarios, not all actions are available at every decision stage. The available action set may depend on the current environment state, domain-specific constraints, or other (potentially stochastic) factors outside the agent’s control. To address these realistic scenarios, we introduce a novel paradigm called Sleeping Reinforcement Learning, where the available action set varies during the interaction with the environment. We start with the simpler scenario in which the available action sets are revealed at the beginning of each episode. We show that a modification of UCBVI achieves regret of order $\widetilde{\mathcal{O}}(H\sqrt{SAT})$, where $H$ is the horizon, $S$ and $A$ are the cardinalities of the state and action spaces, respectively, and $T$ is the learning horizon. Next, we address the more challenging and realistic scenario in which the available actions are disclosed only at each decision stage. By leveraging a novel construction, we establish a minimax lower bound of order $\Omega(\sqrt{T 2^{A/2}})$ when the availability of actions is governed by a Markovian process, establishing a statistical barrier of the problem. Focusing on the statistically tractable case where action availability depends only on the current state and stage, we propose a new optimistic algorithm that achieves regret guarantees of order $\widetilde{\mathcal{O}}(H\sqrt{SAT})$, showing that the problem shares the same complexity of standard RL.

APA

Drago, S., Mussi, M. & Metelli, A.M.. (2025). Sleeping Reinforcement Learning. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:14439-14498 Available from https://proceedings.mlr.press/v267/drago25a.html.

Sleeping Reinforcement Learning

Abstract

Cite this Paper

Related Material