Off-Belief Learning

Hengyuan Hu; Adam Lerer; Brandon Cui; Luis Pineda; Noam Brown; Jakob Foerster

Off-Belief Learning

Hengyuan Hu, Adam Lerer, Brandon Cui, Luis Pineda, Noam Brown, Jakob Foerster

Proceedings of the 38th International Conference on Machine Learning, PMLR 139:4369-4379, 2021.

Abstract

The standard problem setting in Dec-POMDPs is self-play, where the goal is to find a set of policies that play optimally together. Policies learned through self-play may adopt arbitrary conventions and implicitly rely on multi-step reasoning based on fragile assumptions about other agents’ actions and thus fail when paired with humans or independently trained agents at test time. To address this, we present off-belief learning (OBL). At each timestep OBL agents follow a policy

$\pi_1$ that is optimized assuming past actions were taken by a given, fixed policy (

$\pi_0$ ), but assuming that future actions will be taken by

$\pi_1$ . When

$\pi_0$ is uniform random, OBL converges to an optimal policy that does not rely on inferences based on other agents’ behavior (an optimal grounded policy). OBL can be iterated in a hierarchy, where the optimal policy from one level becomes the input to the next, thereby introducing multi-level cognitive reasoning in a controlled manner. Unlike existing approaches, which may converge to any equilibrium policy, OBL converges to a unique policy, making it suitable for zero-shot coordination (ZSC). OBL can be scaled to high-dimensional settings with a fictitious transition mechanism and shows strong performance in both a toy-setting and the benchmark human-AI & ZSC problem Hanabi.

Cite this Paper

BibTeX


@InProceedings{pmlr-v139-hu21c,
  title = 	 {Off-Belief Learning},
  author =       {Hu, Hengyuan and Lerer, Adam and Cui, Brandon and Pineda, Luis and Brown, Noam and Foerster, Jakob},
  booktitle = 	 {Proceedings of the 38th International Conference on Machine Learning},
  pages = 	 {4369--4379},
  year = 	 {2021},
  editor = 	 {Meila, Marina and Zhang, Tong},
  volume = 	 {139},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {18--24 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v139/hu21c/hu21c.pdf},
  url = 	 {https://proceedings.mlr.press/v139/hu21c.html},
  abstract = 	 {The standard problem setting in Dec-POMDPs is self-play, where the goal is to find a set of policies that play optimally together. Policies learned through self-play may adopt arbitrary conventions and implicitly rely on multi-step reasoning based on fragile assumptions about other agents’ actions and thus fail when paired with humans or independently trained agents at test time. To address this, we present off-belief learning (OBL). At each timestep OBL agents follow a policy $\pi_1$ that is optimized assuming past actions were taken by a given, fixed policy ($\pi_0$), but assuming that future actions will be taken by $\pi_1$. When $\pi_0$ is uniform random, OBL converges to an optimal policy that does not rely on inferences based on other agents’ behavior (an optimal grounded policy). OBL can be iterated in a hierarchy, where the optimal policy from one level becomes the input to the next, thereby introducing multi-level cognitive reasoning in a controlled manner. Unlike existing approaches, which may converge to any equilibrium policy, OBL converges to a unique policy, making it suitable for zero-shot coordination (ZSC). OBL can be scaled to high-dimensional settings with a fictitious transition mechanism and shows strong performance in both a toy-setting and the benchmark human-AI & ZSC problem Hanabi.}
}

Endnote

%0 Conference Paper
%T Off-Belief Learning
%A Hengyuan Hu
%A Adam Lerer
%A Brandon Cui
%A Luis Pineda
%A Noam Brown
%A Jakob Foerster
%B Proceedings of the 38th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2021
%E Marina Meila
%E Tong Zhang	
%F pmlr-v139-hu21c
%I PMLR
%P 4369--4379
%U https://proceedings.mlr.press/v139/hu21c.html
%V 139
%X The standard problem setting in Dec-POMDPs is self-play, where the goal is to find a set of policies that play optimally together. Policies learned through self-play may adopt arbitrary conventions and implicitly rely on multi-step reasoning based on fragile assumptions about other agents’ actions and thus fail when paired with humans or independently trained agents at test time. To address this, we present off-belief learning (OBL). At each timestep OBL agents follow a policy $\pi_1$ that is optimized assuming past actions were taken by a given, fixed policy ($\pi_0$), but assuming that future actions will be taken by $\pi_1$. When $\pi_0$ is uniform random, OBL converges to an optimal policy that does not rely on inferences based on other agents’ behavior (an optimal grounded policy). OBL can be iterated in a hierarchy, where the optimal policy from one level becomes the input to the next, thereby introducing multi-level cognitive reasoning in a controlled manner. Unlike existing approaches, which may converge to any equilibrium policy, OBL converges to a unique policy, making it suitable for zero-shot coordination (ZSC). OBL can be scaled to high-dimensional settings with a fictitious transition mechanism and shows strong performance in both a toy-setting and the benchmark human-AI & ZSC problem Hanabi.

APA


Hu, H., Lerer, A., Cui, B., Pineda, L., Brown, N. & Foerster, J.. (2021). Off-Belief Learning. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:4369-4379 Available from https://proceedings.mlr.press/v139/hu21c.html.

Off-Belief Learning

Abstract

Cite this Paper

Related Material