Policy Iteration for Two-Player General-Sum Stochastic Stackelberg Games

Mikoto Kudo, Youhei Akimoto
Proceedings of the 17th Asian Conference on Machine Learning, PMLR 304:718-733, 2025.

Abstract

We address two-player general-sum stochastic Stackelberg games (SSGs), where the leader’s policy is optimized considering the best-response follower whose policy is optimal for its reward under the leader. Existing policy gradient and value iteration approaches for SSGs do not guarantee monotone improvement in the leader’s policy under the best-response follower. Consequently, their performance is not guaranteed when their limits are not stationary Stackelberg equilibria (SSEs), which do not necessarily exist. In this paper, we derive a policy improvement theorem for SSGs under the best-response follower and propose a novel policy iteration algorithm that guarantees monotone improvement in the leader’s performance. Additionally, we introduce Pareto-optimality as an extended optimality of the SSE and prove that our method converges to the Pareto front when the leader is myopic.

Cite this Paper


BibTeX
@InProceedings{pmlr-v304-kudo25a, title = {Policy Iteration for Two-Player General-Sum Stochastic Stackelberg Games}, author = {Kudo, Mikoto and Akimoto, Youhei}, booktitle = {Proceedings of the 17th Asian Conference on Machine Learning}, pages = {718--733}, year = {2025}, editor = {Lee, Hung-yi and Liu, Tongliang}, volume = {304}, series = {Proceedings of Machine Learning Research}, month = {09--12 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v304/main/assets/kudo25a/kudo25a.pdf}, url = {https://proceedings.mlr.press/v304/kudo25a.html}, abstract = {We address two-player general-sum stochastic Stackelberg games (SSGs), where the leader’s policy is optimized considering the best-response follower whose policy is optimal for its reward under the leader. Existing policy gradient and value iteration approaches for SSGs do not guarantee monotone improvement in the leader’s policy under the best-response follower. Consequently, their performance is not guaranteed when their limits are not stationary Stackelberg equilibria (SSEs), which do not necessarily exist. In this paper, we derive a policy improvement theorem for SSGs under the best-response follower and propose a novel policy iteration algorithm that guarantees monotone improvement in the leader’s performance. Additionally, we introduce Pareto-optimality as an extended optimality of the SSE and prove that our method converges to the Pareto front when the leader is myopic.} }
Endnote
%0 Conference Paper %T Policy Iteration for Two-Player General-Sum Stochastic Stackelberg Games %A Mikoto Kudo %A Youhei Akimoto %B Proceedings of the 17th Asian Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Hung-yi Lee %E Tongliang Liu %F pmlr-v304-kudo25a %I PMLR %P 718--733 %U https://proceedings.mlr.press/v304/kudo25a.html %V 304 %X We address two-player general-sum stochastic Stackelberg games (SSGs), where the leader’s policy is optimized considering the best-response follower whose policy is optimal for its reward under the leader. Existing policy gradient and value iteration approaches for SSGs do not guarantee monotone improvement in the leader’s policy under the best-response follower. Consequently, their performance is not guaranteed when their limits are not stationary Stackelberg equilibria (SSEs), which do not necessarily exist. In this paper, we derive a policy improvement theorem for SSGs under the best-response follower and propose a novel policy iteration algorithm that guarantees monotone improvement in the leader’s performance. Additionally, we introduce Pareto-optimality as an extended optimality of the SSE and prove that our method converges to the Pareto front when the leader is myopic.
APA
Kudo, M. & Akimoto, Y.. (2025). Policy Iteration for Two-Player General-Sum Stochastic Stackelberg Games. Proceedings of the 17th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 304:718-733 Available from https://proceedings.mlr.press/v304/kudo25a.html.

Related Material