Safe Policy Iteration

Matteo Pirotta; Marcello Restelli; Alessio Pecorino; Daniele Calandriello

Safe Policy Iteration

Matteo Pirotta, Marcello Restelli, Alessio Pecorino, Daniele Calandriello

Proceedings of the 30th International Conference on Machine Learning, PMLR 28(3):307-315, 2013.

Abstract

This paper presents a study of the policy improvement step that can be usefully exploited by approximate policy-iteration algorithms. When either the policy evaluation step or the policy improvement step returns an approximated result, the sequence of policies produced by policy iteration may not be monotonically increasing, and oscillations may occur. To address this issue, we consider safe policy improvements, i.e., at each iteration we search for a policy that maximizes a lower bound to the policy improvement w.r.t. the current policy. When no improving policy can be found the algorithm stops. We propose two safe policy-iteration algorithms that differ in the way the next policy is chosen w.r.t. the estimated greedy policy. Besides being theoretically derived and discussed, the proposed algorithms are empirically evaluated and compared with state-of-the-art approaches on some chain-walk domains and on the Blackjack card game.

Cite this Paper

BibTeX


@InProceedings{pmlr-v28-pirotta13,
  title = 	 {Safe Policy Iteration},
  author = 	 {Pirotta, Matteo and Restelli, Marcello and Pecorino, Alessio and Calandriello, Daniele},
  booktitle = 	 {Proceedings of the 30th International Conference on Machine Learning},
  pages = 	 {307--315},
  year = 	 {2013},
  editor = 	 {Dasgupta, Sanjoy and McAllester, David},
  volume = 	 {28},
  number =       {3},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Atlanta, Georgia, USA},
  month = 	 {17--19 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v28/pirotta13.pdf},
  url = 	 {https://proceedings.mlr.press/v28/pirotta13.html},
  abstract = 	 {This paper presents a study of the policy improvement step that can be usefully exploited by approximate policy-iteration algorithms.  When either the policy evaluation step or the policy improvement step returns an approximated result, the sequence of policies produced by policy iteration may not be monotonically increasing, and oscillations may occur.  To address this issue, we consider safe policy improvements, i.e., at each iteration we search for a policy that maximizes a lower bound to the policy improvement w.r.t. the current policy. When no improving policy can be found the algorithm stops.  We propose two safe policy-iteration algorithms that differ in the way the next policy is chosen w.r.t. the estimated greedy policy.  Besides being theoretically derived and discussed, the proposed algorithms are empirically evaluated and compared with state-of-the-art approaches on some chain-walk domains and on the Blackjack card game.}
}

Endnote

%0 Conference Paper
%T Safe Policy Iteration
%A Matteo Pirotta
%A Marcello Restelli
%A Alessio Pecorino
%A Daniele Calandriello
%B Proceedings of the 30th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2013
%E Sanjoy Dasgupta
%E David McAllester	
%F pmlr-v28-pirotta13
%I PMLR
%P 307--315
%U https://proceedings.mlr.press/v28/pirotta13.html
%V 28
%N 3
%X This paper presents a study of the policy improvement step that can be usefully exploited by approximate policy-iteration algorithms.  When either the policy evaluation step or the policy improvement step returns an approximated result, the sequence of policies produced by policy iteration may not be monotonically increasing, and oscillations may occur.  To address this issue, we consider safe policy improvements, i.e., at each iteration we search for a policy that maximizes a lower bound to the policy improvement w.r.t. the current policy. When no improving policy can be found the algorithm stops.  We propose two safe policy-iteration algorithms that differ in the way the next policy is chosen w.r.t. the estimated greedy policy.  Besides being theoretically derived and discussed, the proposed algorithms are empirically evaluated and compared with state-of-the-art approaches on some chain-walk domains and on the Blackjack card game.

RIS


TY  - CPAPER
TI  - Safe Policy Iteration
AU  - Matteo Pirotta
AU  - Marcello Restelli
AU  - Alessio Pecorino
AU  - Daniele Calandriello
BT  - Proceedings of the 30th International Conference on Machine Learning
DA  - 2013/05/26
ED  - Sanjoy Dasgupta
ED  - David McAllester	
ID  - pmlr-v28-pirotta13
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 28
IS  - 3
SP  - 307
EP  - 315
L1  - http://proceedings.mlr.press/v28/pirotta13.pdf
UR  - https://proceedings.mlr.press/v28/pirotta13.html
AB  - This paper presents a study of the policy improvement step that can be usefully exploited by approximate policy-iteration algorithms.  When either the policy evaluation step or the policy improvement step returns an approximated result, the sequence of policies produced by policy iteration may not be monotonically increasing, and oscillations may occur.  To address this issue, we consider safe policy improvements, i.e., at each iteration we search for a policy that maximizes a lower bound to the policy improvement w.r.t. the current policy. When no improving policy can be found the algorithm stops.  We propose two safe policy-iteration algorithms that differ in the way the next policy is chosen w.r.t. the estimated greedy policy.  Besides being theoretically derived and discussed, the proposed algorithms are empirically evaluated and compared with state-of-the-art approaches on some chain-walk domains and on the Blackjack card game.
ER  -

APA


Pirotta, M., Restelli, M., Pecorino, A. & Calandriello, D.. (2013). Safe Policy Iteration. Proceedings of the 30th International Conference on Machine Learning, in Proceedings of Machine Learning Research 28(3):307-315 Available from https://proceedings.mlr.press/v28/pirotta13.html.

Safe Policy Iteration

Abstract

Cite this Paper

Related Material