Best of Both Worlds Policy Optimization

Christoph Dann; Chen-Yu Wei; Julian Zimmert

Best of Both Worlds Policy Optimization

Christoph Dann, Chen-Yu Wei, Julian Zimmert

Proceedings of the 40th International Conference on Machine Learning, PMLR 202:6968-7008, 2023.

Abstract

Policy optimization methods are popular reinforcement learning algorithms in practice and recent works have build theoretical foundation for them by proving

$\sqrt{T}$ regret bounds even when the losses are adversarial. Such bounds are tight in the worst case but often overly pessimistic. In this work, we show that by carefully designing the regularizer, bonus terms, and learning rates, one can achieve a more favorable

$\text{polylog}(T)$ regret bound when the losses are stochastic, without sacrificing the worst-case guarantee in the adversarial regime. Specifically, we show the first best of both worlds guarantee for policy optimization in tabular MDPs by leveraging either a Tsallis entropy or a Shannon entropy regularizer. Then we show that under known transitions, we can further obtain a first-order regret bound in the adversarial regime by leveraging the log barrier regularizer.

Cite this Paper

BibTeX


@InProceedings{pmlr-v202-dann23b,
  title = 	 {Best of Both Worlds Policy Optimization},
  author =       {Dann, Christoph and Wei, Chen-Yu and Zimmert, Julian},
  booktitle = 	 {Proceedings of the 40th International Conference on Machine Learning},
  pages = 	 {6968--7008},
  year = 	 {2023},
  editor = 	 {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan},
  volume = 	 {202},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {23--29 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v202/dann23b/dann23b.pdf},
  url = 	 {https://proceedings.mlr.press/v202/dann23b.html},
  abstract = 	 {Policy optimization methods are popular reinforcement learning algorithms in practice and recent works have build theoretical foundation for them by proving $\sqrt{T}$ regret bounds even when the losses are adversarial. Such bounds are tight in the worst case but often overly pessimistic. In this work, we show that by carefully designing the regularizer, bonus terms, and learning rates, one can achieve a more favorable $\text{polylog}(T)$ regret bound when the losses are stochastic, without sacrificing the worst-case guarantee in the adversarial regime. Specifically, we show the first best of both worlds guarantee for policy optimization in tabular MDPs by leveraging either a Tsallis entropy or a Shannon entropy regularizer. Then we show that under known transitions, we can further obtain a first-order regret bound in the adversarial regime by leveraging the log barrier regularizer.}
}

Endnote

%0 Conference Paper
%T Best of Both Worlds Policy Optimization
%A Christoph Dann
%A Chen-Yu Wei
%A Julian Zimmert
%B Proceedings of the 40th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2023
%E Andreas Krause
%E Emma Brunskill
%E Kyunghyun Cho
%E Barbara Engelhardt
%E Sivan Sabato
%E Jonathan Scarlett	
%F pmlr-v202-dann23b
%I PMLR
%P 6968--7008
%U https://proceedings.mlr.press/v202/dann23b.html
%V 202
%X Policy optimization methods are popular reinforcement learning algorithms in practice and recent works have build theoretical foundation for them by proving $\sqrt{T}$ regret bounds even when the losses are adversarial. Such bounds are tight in the worst case but often overly pessimistic. In this work, we show that by carefully designing the regularizer, bonus terms, and learning rates, one can achieve a more favorable $\text{polylog}(T)$ regret bound when the losses are stochastic, without sacrificing the worst-case guarantee in the adversarial regime. Specifically, we show the first best of both worlds guarantee for policy optimization in tabular MDPs by leveraging either a Tsallis entropy or a Shannon entropy regularizer. Then we show that under known transitions, we can further obtain a first-order regret bound in the adversarial regime by leveraging the log barrier regularizer.

APA


Dann, C., Wei, C. & Zimmert, J.. (2023). Best of Both Worlds Policy Optimization. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:6968-7008 Available from https://proceedings.mlr.press/v202/dann23b.html.

Best of Both Worlds Policy Optimization

Abstract

Cite this Paper

Related Material