Kernel-Based Reinforcement Learning: A Finite-Time Analysis

Omar Darwiche Domingues, Pierre Menard, Matteo Pirotta, Emilie Kaufmann, Michal Valko
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:2783-2792, 2021.

Abstract

We consider the exploration-exploitation dilemma in finite-horizon reinforcement learning problems whose state-action space is endowed with a metric. We introduce Kernel-UCBVI, a model-based optimistic algorithm that leverages the smoothness of the MDP and a non-parametric kernel estimator of the rewards and transitions to efficiently balance exploration and exploitation. For problems with $K$ episodes and horizon $H$, we provide a regret bound of $\widetilde{O}\left( H^3 K^{\frac{2d}{2d+1}}\right)$, where $d$ is the covering dimension of the joint state-action space. This is the first regret bound for kernel-based RL using smoothing kernels, which requires very weak assumptions on the MDP and applies to a wide range of tasks. We empirically validate our approach in continuous MDPs with sparse rewards.

Cite this Paper


BibTeX
@InProceedings{pmlr-v139-domingues21a, title = {Kernel-Based Reinforcement Learning: A Finite-Time Analysis}, author = {Domingues, Omar Darwiche and Menard, Pierre and Pirotta, Matteo and Kaufmann, Emilie and Valko, Michal}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {2783--2792}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v139/domingues21a/domingues21a.pdf}, url = {https://proceedings.mlr.press/v139/domingues21a.html}, abstract = {We consider the exploration-exploitation dilemma in finite-horizon reinforcement learning problems whose state-action space is endowed with a metric. We introduce Kernel-UCBVI, a model-based optimistic algorithm that leverages the smoothness of the MDP and a non-parametric kernel estimator of the rewards and transitions to efficiently balance exploration and exploitation. For problems with $K$ episodes and horizon $H$, we provide a regret bound of $\widetilde{O}\left( H^3 K^{\frac{2d}{2d+1}}\right)$, where $d$ is the covering dimension of the joint state-action space. This is the first regret bound for kernel-based RL using smoothing kernels, which requires very weak assumptions on the MDP and applies to a wide range of tasks. We empirically validate our approach in continuous MDPs with sparse rewards.} }
Endnote
%0 Conference Paper %T Kernel-Based Reinforcement Learning: A Finite-Time Analysis %A Omar Darwiche Domingues %A Pierre Menard %A Matteo Pirotta %A Emilie Kaufmann %A Michal Valko %B Proceedings of the 38th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Marina Meila %E Tong Zhang %F pmlr-v139-domingues21a %I PMLR %P 2783--2792 %U https://proceedings.mlr.press/v139/domingues21a.html %V 139 %X We consider the exploration-exploitation dilemma in finite-horizon reinforcement learning problems whose state-action space is endowed with a metric. We introduce Kernel-UCBVI, a model-based optimistic algorithm that leverages the smoothness of the MDP and a non-parametric kernel estimator of the rewards and transitions to efficiently balance exploration and exploitation. For problems with $K$ episodes and horizon $H$, we provide a regret bound of $\widetilde{O}\left( H^3 K^{\frac{2d}{2d+1}}\right)$, where $d$ is the covering dimension of the joint state-action space. This is the first regret bound for kernel-based RL using smoothing kernels, which requires very weak assumptions on the MDP and applies to a wide range of tasks. We empirically validate our approach in continuous MDPs with sparse rewards.
APA
Domingues, O.D., Menard, P., Pirotta, M., Kaufmann, E. & Valko, M.. (2021). Kernel-Based Reinforcement Learning: A Finite-Time Analysis. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:2783-2792 Available from https://proceedings.mlr.press/v139/domingues21a.html.

Related Material