Learning Imperfect Information Extensive-form Games with Last-iterate Convergence under Bandit Feedback

Canzhe Zhao, Yutian Cheng, Jing Dong, Baoxiang Wang, Shuai Li
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:77405-77440, 2025.

Abstract

We investigate learning approximate Nash equilibrium (NE) policy profiles in two-player zero-sum imperfect information extensive-form games (IIEFGs) with last-iterate convergence guarantees. Existing algorithms either rely on full-information feedback or provide only asymptotic convergence rates. In contrast, we focus on the bandit feedback setting, where players receive feedback solely from the rewards associated with the experienced information set and action pairs in each episode. Our proposed algorithm employs a negentropy regularizer weighted by a "virtual transition" over the information set-action space to facilitate an efficient approximate policy update. Through a carefully designed virtual transition and leveraging the entropy regularization technique, we demonstrate finite-time last-iterate convergence to the NE with a rate of $\widetilde{\mathcal{O}}(k^{-\frac{1}{8}})$ under bandit feedback in each episode $k$. Empirical evaluations across various IIEFG instances show its competitive performance compared to baseline methods.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-zhao25h, title = {Learning Imperfect Information Extensive-form Games with Last-iterate Convergence under Bandit Feedback}, author = {Zhao, Canzhe and Cheng, Yutian and Dong, Jing and Wang, Baoxiang and Li, Shuai}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {77405--77440}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/zhao25h/zhao25h.pdf}, url = {https://proceedings.mlr.press/v267/zhao25h.html}, abstract = {We investigate learning approximate Nash equilibrium (NE) policy profiles in two-player zero-sum imperfect information extensive-form games (IIEFGs) with last-iterate convergence guarantees. Existing algorithms either rely on full-information feedback or provide only asymptotic convergence rates. In contrast, we focus on the bandit feedback setting, where players receive feedback solely from the rewards associated with the experienced information set and action pairs in each episode. Our proposed algorithm employs a negentropy regularizer weighted by a "virtual transition" over the information set-action space to facilitate an efficient approximate policy update. Through a carefully designed virtual transition and leveraging the entropy regularization technique, we demonstrate finite-time last-iterate convergence to the NE with a rate of $\widetilde{\mathcal{O}}(k^{-\frac{1}{8}})$ under bandit feedback in each episode $k$. Empirical evaluations across various IIEFG instances show its competitive performance compared to baseline methods.} }
Endnote
%0 Conference Paper %T Learning Imperfect Information Extensive-form Games with Last-iterate Convergence under Bandit Feedback %A Canzhe Zhao %A Yutian Cheng %A Jing Dong %A Baoxiang Wang %A Shuai Li %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-zhao25h %I PMLR %P 77405--77440 %U https://proceedings.mlr.press/v267/zhao25h.html %V 267 %X We investigate learning approximate Nash equilibrium (NE) policy profiles in two-player zero-sum imperfect information extensive-form games (IIEFGs) with last-iterate convergence guarantees. Existing algorithms either rely on full-information feedback or provide only asymptotic convergence rates. In contrast, we focus on the bandit feedback setting, where players receive feedback solely from the rewards associated with the experienced information set and action pairs in each episode. Our proposed algorithm employs a negentropy regularizer weighted by a "virtual transition" over the information set-action space to facilitate an efficient approximate policy update. Through a carefully designed virtual transition and leveraging the entropy regularization technique, we demonstrate finite-time last-iterate convergence to the NE with a rate of $\widetilde{\mathcal{O}}(k^{-\frac{1}{8}})$ under bandit feedback in each episode $k$. Empirical evaluations across various IIEFG instances show its competitive performance compared to baseline methods.
APA
Zhao, C., Cheng, Y., Dong, J., Wang, B. & Li, S.. (2025). Learning Imperfect Information Extensive-form Games with Last-iterate Convergence under Bandit Feedback. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:77405-77440 Available from https://proceedings.mlr.press/v267/zhao25h.html.

Related Material