Reinforcement Learning for Cost-Aware Markov Decision Processes

Wesley Suttle, Kaiqing Zhang, Zhuoran Yang, Ji Liu, David Kraemer
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:9989-9999, 2021.

Abstract

Ratio maximization has applications in areas as diverse as finance, reward shaping for reinforcement learning (RL), and the development of safe artificial intelligence, yet there has been very little exploration of RL algorithms for ratio maximization. This paper addresses this deficiency by introducing two new, model-free RL algorithms for solving cost-aware Markov decision processes, where the goal is to maximize the ratio of long-run average reward to long-run average cost. The first algorithm is a two-timescale scheme based on relative value iteration (RVI) Q-learning and the second is an actor-critic scheme. The paper proves almost sure convergence of the former to the globally optimal solution in the tabular case and almost sure convergence of the latter under linear function approximation for the critic. Unlike previous methods, the two algorithms provably converge for general reward and cost functions under suitable conditions. The paper also provides empirical results demonstrating promising performance and lending strong support to the theoretical results.

Cite this Paper


BibTeX
@InProceedings{pmlr-v139-suttle21a, title = {Reinforcement Learning for Cost-Aware Markov Decision Processes}, author = {Suttle, Wesley and Zhang, Kaiqing and Yang, Zhuoran and Liu, Ji and Kraemer, David}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {9989--9999}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v139/suttle21a/suttle21a.pdf}, url = {https://proceedings.mlr.press/v139/suttle21a.html}, abstract = {Ratio maximization has applications in areas as diverse as finance, reward shaping for reinforcement learning (RL), and the development of safe artificial intelligence, yet there has been very little exploration of RL algorithms for ratio maximization. This paper addresses this deficiency by introducing two new, model-free RL algorithms for solving cost-aware Markov decision processes, where the goal is to maximize the ratio of long-run average reward to long-run average cost. The first algorithm is a two-timescale scheme based on relative value iteration (RVI) Q-learning and the second is an actor-critic scheme. The paper proves almost sure convergence of the former to the globally optimal solution in the tabular case and almost sure convergence of the latter under linear function approximation for the critic. Unlike previous methods, the two algorithms provably converge for general reward and cost functions under suitable conditions. The paper also provides empirical results demonstrating promising performance and lending strong support to the theoretical results.} }
Endnote
%0 Conference Paper %T Reinforcement Learning for Cost-Aware Markov Decision Processes %A Wesley Suttle %A Kaiqing Zhang %A Zhuoran Yang %A Ji Liu %A David Kraemer %B Proceedings of the 38th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Marina Meila %E Tong Zhang %F pmlr-v139-suttle21a %I PMLR %P 9989--9999 %U https://proceedings.mlr.press/v139/suttle21a.html %V 139 %X Ratio maximization has applications in areas as diverse as finance, reward shaping for reinforcement learning (RL), and the development of safe artificial intelligence, yet there has been very little exploration of RL algorithms for ratio maximization. This paper addresses this deficiency by introducing two new, model-free RL algorithms for solving cost-aware Markov decision processes, where the goal is to maximize the ratio of long-run average reward to long-run average cost. The first algorithm is a two-timescale scheme based on relative value iteration (RVI) Q-learning and the second is an actor-critic scheme. The paper proves almost sure convergence of the former to the globally optimal solution in the tabular case and almost sure convergence of the latter under linear function approximation for the critic. Unlike previous methods, the two algorithms provably converge for general reward and cost functions under suitable conditions. The paper also provides empirical results demonstrating promising performance and lending strong support to the theoretical results.
APA
Suttle, W., Zhang, K., Yang, Z., Liu, J. & Kraemer, D.. (2021). Reinforcement Learning for Cost-Aware Markov Decision Processes. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:9989-9999 Available from https://proceedings.mlr.press/v139/suttle21a.html.

Related Material