Optimistic PAC Reinforcement Learning: the Instance-Dependent View

Andrea Tirinzoni; Aymen Al-Marjani; Emilie Kaufmann

Optimistic PAC Reinforcement Learning: the Instance-Dependent View

Andrea Tirinzoni, Aymen Al-Marjani, Emilie Kaufmann

Proceedings of The 34th International Conference on Algorithmic Learning Theory, PMLR 201:1460-1480, 2023.

Abstract

Optimistic algorithms have been extensively studied for regret minimization in episodic tabular Markov Decision Processes (MDPs), both from a minimax and an instance-dependent view. However, for the PAC RL problem, where the goal is to identify a near-optimal policy with high probability, little is known about their instance-dependent sample complexity. A negative result of Wagenmaker et al. (2022) suggests that optimistic sampling rules cannot be used to attain the (still elusive) optimal instance-dependent sample complexity. On the positive side, we provide the first instance-dependent bound for an optimistic algorithm for PAC RL, BPI-UCRL, for which only minimax guarantees were available (Kaufmann et al., 2021). While our bound features some minimal visitation probabilities, it also features a refined notion of sub-optimality gap compared to the value gaps that appear in prior work. Moreover, in MDPs with deterministic transitions, we show that BPI-UCRL is actually near instance-optimal (up to a factor of the horizon). On the technical side, our analysis is very simple thanks to a new “target trick” of independent interest. We complement these findings with a novel hardness result explaining why the instance-dependent complexity of PAC RL cannot be easily related to that of regret minimization, unlike in the minimax regime.

Cite this Paper

BibTeX


@InProceedings{pmlr-v201-tirinzoni23a,
  title = 	 {Optimistic PAC Reinforcement Learning: the Instance-Dependent View},
  author =       {Tirinzoni, Andrea and Al-Marjani, Aymen and Kaufmann, Emilie},
  booktitle = 	 {Proceedings of The 34th International Conference on Algorithmic Learning Theory},
  pages = 	 {1460--1480},
  year = 	 {2023},
  editor = 	 {Agrawal, Shipra and Orabona, Francesco},
  volume = 	 {201},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {20 Feb--23 Feb},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v201/tirinzoni23a/tirinzoni23a.pdf},
  url = 	 {https://proceedings.mlr.press/v201/tirinzoni23a.html},
  abstract = 	 {Optimistic algorithms have been extensively studied for regret minimization in episodic tabular Markov Decision Processes (MDPs), both from a minimax and an instance-dependent view. However, for the PAC RL problem, where the goal is to identify a near-optimal policy with high probability, little is known about their instance-dependent sample complexity. A negative result of Wagenmaker et al. (2022) suggests that optimistic sampling rules cannot be used to attain the (still elusive) optimal instance-dependent sample complexity. On the positive side, we provide the first instance-dependent bound for an optimistic algorithm for PAC RL, BPI-UCRL, for which only minimax guarantees were available (Kaufmann et al., 2021). While our bound features some minimal visitation probabilities, it also features a refined notion of sub-optimality gap compared to the value gaps that appear in prior work. Moreover, in MDPs with deterministic transitions, we show that  BPI-UCRL is actually near instance-optimal (up to a factor of the horizon).  On the technical side, our analysis is very simple thanks to a new “target trick” of independent interest. We complement these findings with a novel hardness result explaining why the instance-dependent complexity of PAC RL cannot be easily related to that of regret minimization, unlike in the minimax regime. }
}

Endnote

%0 Conference Paper
%T Optimistic PAC Reinforcement Learning: the Instance-Dependent View
%A Andrea Tirinzoni
%A Aymen Al-Marjani
%A Emilie Kaufmann
%B Proceedings of The 34th International Conference on Algorithmic Learning Theory
%C Proceedings of Machine Learning Research
%D 2023
%E Shipra Agrawal
%E Francesco Orabona	
%F pmlr-v201-tirinzoni23a
%I PMLR
%P 1460--1480
%U https://proceedings.mlr.press/v201/tirinzoni23a.html
%V 201
%X Optimistic algorithms have been extensively studied for regret minimization in episodic tabular Markov Decision Processes (MDPs), both from a minimax and an instance-dependent view. However, for the PAC RL problem, where the goal is to identify a near-optimal policy with high probability, little is known about their instance-dependent sample complexity. A negative result of Wagenmaker et al. (2022) suggests that optimistic sampling rules cannot be used to attain the (still elusive) optimal instance-dependent sample complexity. On the positive side, we provide the first instance-dependent bound for an optimistic algorithm for PAC RL, BPI-UCRL, for which only minimax guarantees were available (Kaufmann et al., 2021). While our bound features some minimal visitation probabilities, it also features a refined notion of sub-optimality gap compared to the value gaps that appear in prior work. Moreover, in MDPs with deterministic transitions, we show that  BPI-UCRL is actually near instance-optimal (up to a factor of the horizon).  On the technical side, our analysis is very simple thanks to a new “target trick” of independent interest. We complement these findings with a novel hardness result explaining why the instance-dependent complexity of PAC RL cannot be easily related to that of regret minimization, unlike in the minimax regime.

APA


Tirinzoni, A., Al-Marjani, A. & Kaufmann, E.. (2023). Optimistic PAC Reinforcement Learning: the Instance-Dependent View. Proceedings of The 34th International Conference on Algorithmic Learning Theory, in Proceedings of Machine Learning Research 201:1460-1480 Available from https://proceedings.mlr.press/v201/tirinzoni23a.html.

Related Material

Download PDF