First-Order Regret in Reinforcement Learning with Linear Function Approximation: A Robust Estimation Approach

Andrew J Wagenmaker, Yifang Chen, Max Simchowitz, Simon Du, Kevin Jamieson
Proceedings of the 39th International Conference on Machine Learning, PMLR 162:22384-22429, 2022.

Abstract

Obtaining first-order regret bounds—regret bounds scaling not as the worst-case but with some measure of the performance of the optimal policy on a given instance—is a core question in sequential decision-making. While such bounds exist in many settings, they have proven elusive in reinforcement learning with large state spaces. In this work we address this gap, and show that it is possible to obtain regret scaling as $\widetilde{\mathcal{O}}(\sqrt{d^3 H^3 \cdot V_1^\star \cdot K} + d^{3.5}H^3\log K )$ in reinforcement learning with large state spaces, namely the linear MDP setting. Here $V_1^\star$ is the value of the optimal policy and $K$ is the number of episodes. We demonstrate that existing techniques based on least squares estimation are insufficient to obtain this result, and instead develop a novel robust self-normalized concentration bound based on the robust Catoni mean estimator, which may be of independent interest.

Cite this Paper


BibTeX
@InProceedings{pmlr-v162-wagenmaker22a, title = {First-Order Regret in Reinforcement Learning with Linear Function Approximation: A Robust Estimation Approach}, author = {Wagenmaker, Andrew J and Chen, Yifang and Simchowitz, Max and Du, Simon and Jamieson, Kevin}, booktitle = {Proceedings of the 39th International Conference on Machine Learning}, pages = {22384--22429}, year = {2022}, editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan}, volume = {162}, series = {Proceedings of Machine Learning Research}, month = {17--23 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v162/wagenmaker22a/wagenmaker22a.pdf}, url = {https://proceedings.mlr.press/v162/wagenmaker22a.html}, abstract = {Obtaining first-order regret bounds—regret bounds scaling not as the worst-case but with some measure of the performance of the optimal policy on a given instance—is a core question in sequential decision-making. While such bounds exist in many settings, they have proven elusive in reinforcement learning with large state spaces. In this work we address this gap, and show that it is possible to obtain regret scaling as $\widetilde{\mathcal{O}}(\sqrt{d^3 H^3 \cdot V_1^\star \cdot K} + d^{3.5}H^3\log K )$ in reinforcement learning with large state spaces, namely the linear MDP setting. Here $V_1^\star$ is the value of the optimal policy and $K$ is the number of episodes. We demonstrate that existing techniques based on least squares estimation are insufficient to obtain this result, and instead develop a novel robust self-normalized concentration bound based on the robust Catoni mean estimator, which may be of independent interest.} }
Endnote
%0 Conference Paper %T First-Order Regret in Reinforcement Learning with Linear Function Approximation: A Robust Estimation Approach %A Andrew J Wagenmaker %A Yifang Chen %A Max Simchowitz %A Simon Du %A Kevin Jamieson %B Proceedings of the 39th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2022 %E Kamalika Chaudhuri %E Stefanie Jegelka %E Le Song %E Csaba Szepesvari %E Gang Niu %E Sivan Sabato %F pmlr-v162-wagenmaker22a %I PMLR %P 22384--22429 %U https://proceedings.mlr.press/v162/wagenmaker22a.html %V 162 %X Obtaining first-order regret bounds—regret bounds scaling not as the worst-case but with some measure of the performance of the optimal policy on a given instance—is a core question in sequential decision-making. While such bounds exist in many settings, they have proven elusive in reinforcement learning with large state spaces. In this work we address this gap, and show that it is possible to obtain regret scaling as $\widetilde{\mathcal{O}}(\sqrt{d^3 H^3 \cdot V_1^\star \cdot K} + d^{3.5}H^3\log K )$ in reinforcement learning with large state spaces, namely the linear MDP setting. Here $V_1^\star$ is the value of the optimal policy and $K$ is the number of episodes. We demonstrate that existing techniques based on least squares estimation are insufficient to obtain this result, and instead develop a novel robust self-normalized concentration bound based on the robust Catoni mean estimator, which may be of independent interest.
APA
Wagenmaker, A.J., Chen, Y., Simchowitz, M., Du, S. & Jamieson, K.. (2022). First-Order Regret in Reinforcement Learning with Linear Function Approximation: A Robust Estimation Approach. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:22384-22429 Available from https://proceedings.mlr.press/v162/wagenmaker22a.html.

Related Material