Modeling Strong and Human-Like Gameplay with KL-Regularized Search

Athul Paul Jacob, David J Wu, Gabriele Farina, Adam Lerer, Hengyuan Hu, Anton Bakhtin, Jacob Andreas, Noam Brown
Proceedings of the 39th International Conference on Machine Learning, PMLR 162:9695-9728, 2022.

Abstract

We consider the task of accurately modeling strong human policies in multi-agent decision-making problems, given examples of human behavior. Imitation learning is effective at predicting human actions but may not match the strength of expert humans (e.g., by sometimes committing blunders), while self-play learning and search techniques such as AlphaZero lead to strong performance but may produce policies that differ markedly from human behavior. In chess and Go, we show that regularized search algorithms that penalize KL divergence from an imitation-learned policy yield higher prediction accuracy of strong humans and better performance than imitation learning alone. We then introduce a novel regret minimization algorithm that is regularized based on the KL divergence from an imitation-learned policy, and show that using this algorithm for search in no-press Diplomacy yields a policy that matches the human prediction accuracy of imitation learning while being substantially stronger.

Cite this Paper


BibTeX
@InProceedings{pmlr-v162-jacob22a, title = {Modeling Strong and Human-Like Gameplay with {KL}-Regularized Search}, author = {Jacob, Athul Paul and Wu, David J and Farina, Gabriele and Lerer, Adam and Hu, Hengyuan and Bakhtin, Anton and Andreas, Jacob and Brown, Noam}, booktitle = {Proceedings of the 39th International Conference on Machine Learning}, pages = {9695--9728}, year = {2022}, editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan}, volume = {162}, series = {Proceedings of Machine Learning Research}, month = {17--23 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v162/jacob22a/jacob22a.pdf}, url = {https://proceedings.mlr.press/v162/jacob22a.html}, abstract = {We consider the task of accurately modeling strong human policies in multi-agent decision-making problems, given examples of human behavior. Imitation learning is effective at predicting human actions but may not match the strength of expert humans (e.g., by sometimes committing blunders), while self-play learning and search techniques such as AlphaZero lead to strong performance but may produce policies that differ markedly from human behavior. In chess and Go, we show that regularized search algorithms that penalize KL divergence from an imitation-learned policy yield higher prediction accuracy of strong humans and better performance than imitation learning alone. We then introduce a novel regret minimization algorithm that is regularized based on the KL divergence from an imitation-learned policy, and show that using this algorithm for search in no-press Diplomacy yields a policy that matches the human prediction accuracy of imitation learning while being substantially stronger.} }
Endnote
%0 Conference Paper %T Modeling Strong and Human-Like Gameplay with KL-Regularized Search %A Athul Paul Jacob %A David J Wu %A Gabriele Farina %A Adam Lerer %A Hengyuan Hu %A Anton Bakhtin %A Jacob Andreas %A Noam Brown %B Proceedings of the 39th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2022 %E Kamalika Chaudhuri %E Stefanie Jegelka %E Le Song %E Csaba Szepesvari %E Gang Niu %E Sivan Sabato %F pmlr-v162-jacob22a %I PMLR %P 9695--9728 %U https://proceedings.mlr.press/v162/jacob22a.html %V 162 %X We consider the task of accurately modeling strong human policies in multi-agent decision-making problems, given examples of human behavior. Imitation learning is effective at predicting human actions but may not match the strength of expert humans (e.g., by sometimes committing blunders), while self-play learning and search techniques such as AlphaZero lead to strong performance but may produce policies that differ markedly from human behavior. In chess and Go, we show that regularized search algorithms that penalize KL divergence from an imitation-learned policy yield higher prediction accuracy of strong humans and better performance than imitation learning alone. We then introduce a novel regret minimization algorithm that is regularized based on the KL divergence from an imitation-learned policy, and show that using this algorithm for search in no-press Diplomacy yields a policy that matches the human prediction accuracy of imitation learning while being substantially stronger.
APA
Jacob, A.P., Wu, D.J., Farina, G., Lerer, A., Hu, H., Bakhtin, A., Andreas, J. & Brown, N.. (2022). Modeling Strong and Human-Like Gameplay with KL-Regularized Search. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:9695-9728 Available from https://proceedings.mlr.press/v162/jacob22a.html.

Related Material