Efficient Learning for AlphaZero via Path Consistency

Dengwei Zhao, Shikui Tu, Lei Xu
Proceedings of the 39th International Conference on Machine Learning, PMLR 162:26971-26981, 2022.

Abstract

In recent years, deep reinforcement learning have made great breakthroughs on board games. Still, most of the works require huge computational resources for a large scale of environmental interactions or self-play for the games. This paper aims at building powerful models under a limited amount of self-plays which can be utilized by a human throughout the lifetime. We proposes a learning algorithm built on AlphaZero, with its path searching regularised by a path consistency (PC) optimality, i.e., values on one optimal search path should be identical. Thus, the algorithm is shortly named PCZero. In implementation, historical trajectory and scouted search paths by MCTS makes a good balance between exploration and exploitation, which enhances the generalization ability effectively. PCZero obtains $94.1%$ winning rate against the champion of Hex Computer Olympiad in 2015 on $13\times 13$ Hex, much higher than $84.3%$ by AlphaZero. The models consume only $900K$ self-play games, about the amount humans can study in a lifetime. The improvements by PCZero have been also generalized to Othello and Gomoku. Experiments also demonstrate the efficiency of PCZero under offline learning setting.

Cite this Paper


BibTeX
@InProceedings{pmlr-v162-zhao22h, title = {Efficient Learning for {A}lpha{Z}ero via Path Consistency}, author = {Zhao, Dengwei and Tu, Shikui and Xu, Lei}, booktitle = {Proceedings of the 39th International Conference on Machine Learning}, pages = {26971--26981}, year = {2022}, editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan}, volume = {162}, series = {Proceedings of Machine Learning Research}, month = {17--23 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v162/zhao22h/zhao22h.pdf}, url = {https://proceedings.mlr.press/v162/zhao22h.html}, abstract = {In recent years, deep reinforcement learning have made great breakthroughs on board games. Still, most of the works require huge computational resources for a large scale of environmental interactions or self-play for the games. This paper aims at building powerful models under a limited amount of self-plays which can be utilized by a human throughout the lifetime. We proposes a learning algorithm built on AlphaZero, with its path searching regularised by a path consistency (PC) optimality, i.e., values on one optimal search path should be identical. Thus, the algorithm is shortly named PCZero. In implementation, historical trajectory and scouted search paths by MCTS makes a good balance between exploration and exploitation, which enhances the generalization ability effectively. PCZero obtains $94.1%$ winning rate against the champion of Hex Computer Olympiad in 2015 on $13\times 13$ Hex, much higher than $84.3%$ by AlphaZero. The models consume only $900K$ self-play games, about the amount humans can study in a lifetime. The improvements by PCZero have been also generalized to Othello and Gomoku. Experiments also demonstrate the efficiency of PCZero under offline learning setting.} }
Endnote
%0 Conference Paper %T Efficient Learning for AlphaZero via Path Consistency %A Dengwei Zhao %A Shikui Tu %A Lei Xu %B Proceedings of the 39th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2022 %E Kamalika Chaudhuri %E Stefanie Jegelka %E Le Song %E Csaba Szepesvari %E Gang Niu %E Sivan Sabato %F pmlr-v162-zhao22h %I PMLR %P 26971--26981 %U https://proceedings.mlr.press/v162/zhao22h.html %V 162 %X In recent years, deep reinforcement learning have made great breakthroughs on board games. Still, most of the works require huge computational resources for a large scale of environmental interactions or self-play for the games. This paper aims at building powerful models under a limited amount of self-plays which can be utilized by a human throughout the lifetime. We proposes a learning algorithm built on AlphaZero, with its path searching regularised by a path consistency (PC) optimality, i.e., values on one optimal search path should be identical. Thus, the algorithm is shortly named PCZero. In implementation, historical trajectory and scouted search paths by MCTS makes a good balance between exploration and exploitation, which enhances the generalization ability effectively. PCZero obtains $94.1%$ winning rate against the champion of Hex Computer Olympiad in 2015 on $13\times 13$ Hex, much higher than $84.3%$ by AlphaZero. The models consume only $900K$ self-play games, about the amount humans can study in a lifetime. The improvements by PCZero have been also generalized to Othello and Gomoku. Experiments also demonstrate the efficiency of PCZero under offline learning setting.
APA
Zhao, D., Tu, S. & Xu, L.. (2022). Efficient Learning for AlphaZero via Path Consistency. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:26971-26981 Available from https://proceedings.mlr.press/v162/zhao22h.html.

Related Material