Exploration in Reward Machines with Low Regret

Hippolyte Bourel, Anders Jonsson, Odalric-Ambrym Maillard, Mohammad Sadegh Talebi
Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, PMLR 206:4114-4146, 2023.

Abstract

We study reinforcement learning (RL) for decision processes with non-Markovian reward, in which high-level knowledge in the form of reward machines is available to the learner. Specifically, we investigate the efficiency of RL under the average-reward criterion, in the regret minimization setting. We propose two model-based RL algorithms that each exploits the structure of the reward machines, and show that our algorithms achieve regret bounds that improve over those of baselines by a multiplicative factor proportional to the number of states in the underlying reward machine. To the best of our knowledge, the proposed algorithms and associated regret bounds are the first to tailor the analysis specifically to reward machines, either in the episodic or average-reward settings. We also present a regret lower bound for the studied setting, which indicates that the proposed algorithms achieve a near-optimal regret. Finally, we report numerical experiments that demonstrate the superiority of the proposed algorithms over existing baselines in practice.

Cite this Paper


BibTeX
@InProceedings{pmlr-v206-bourel23a, title = {Exploration in Reward Machines with Low Regret}, author = {Bourel, Hippolyte and Jonsson, Anders and Maillard, Odalric-Ambrym and Talebi, Mohammad Sadegh}, booktitle = {Proceedings of The 26th International Conference on Artificial Intelligence and Statistics}, pages = {4114--4146}, year = {2023}, editor = {Ruiz, Francisco and Dy, Jennifer and van de Meent, Jan-Willem}, volume = {206}, series = {Proceedings of Machine Learning Research}, month = {25--27 Apr}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v206/bourel23a/bourel23a.pdf}, url = {https://proceedings.mlr.press/v206/bourel23a.html}, abstract = {We study reinforcement learning (RL) for decision processes with non-Markovian reward, in which high-level knowledge in the form of reward machines is available to the learner. Specifically, we investigate the efficiency of RL under the average-reward criterion, in the regret minimization setting. We propose two model-based RL algorithms that each exploits the structure of the reward machines, and show that our algorithms achieve regret bounds that improve over those of baselines by a multiplicative factor proportional to the number of states in the underlying reward machine. To the best of our knowledge, the proposed algorithms and associated regret bounds are the first to tailor the analysis specifically to reward machines, either in the episodic or average-reward settings. We also present a regret lower bound for the studied setting, which indicates that the proposed algorithms achieve a near-optimal regret. Finally, we report numerical experiments that demonstrate the superiority of the proposed algorithms over existing baselines in practice.} }
Endnote
%0 Conference Paper %T Exploration in Reward Machines with Low Regret %A Hippolyte Bourel %A Anders Jonsson %A Odalric-Ambrym Maillard %A Mohammad Sadegh Talebi %B Proceedings of The 26th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2023 %E Francisco Ruiz %E Jennifer Dy %E Jan-Willem van de Meent %F pmlr-v206-bourel23a %I PMLR %P 4114--4146 %U https://proceedings.mlr.press/v206/bourel23a.html %V 206 %X We study reinforcement learning (RL) for decision processes with non-Markovian reward, in which high-level knowledge in the form of reward machines is available to the learner. Specifically, we investigate the efficiency of RL under the average-reward criterion, in the regret minimization setting. We propose two model-based RL algorithms that each exploits the structure of the reward machines, and show that our algorithms achieve regret bounds that improve over those of baselines by a multiplicative factor proportional to the number of states in the underlying reward machine. To the best of our knowledge, the proposed algorithms and associated regret bounds are the first to tailor the analysis specifically to reward machines, either in the episodic or average-reward settings. We also present a regret lower bound for the studied setting, which indicates that the proposed algorithms achieve a near-optimal regret. Finally, we report numerical experiments that demonstrate the superiority of the proposed algorithms over existing baselines in practice.
APA
Bourel, H., Jonsson, A., Maillard, O. & Talebi, M.S.. (2023). Exploration in Reward Machines with Low Regret. Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 206:4114-4146 Available from https://proceedings.mlr.press/v206/bourel23a.html.

Related Material