Posterior Value Functions: Hindsight Baselines for Policy Gradient Methods

Chris Nota; Philip Thomas; Bruno C. Da Silva

Posterior Value Functions: Hindsight Baselines for Policy Gradient Methods

Chris Nota, Philip Thomas, Bruno C. Da Silva

Proceedings of the 38th International Conference on Machine Learning, PMLR 139:8238-8247, 2021.

Abstract

Hindsight allows reinforcement learning agents to leverage new observations to make inferences about earlier states and transitions. In this paper, we exploit the idea of hindsight and introduce posterior value functions. Posterior value functions are computed by inferring the posterior distribution over hidden components of the state in previous timesteps and can be used to construct novel unbiased baselines for policy gradient methods. Importantly, we prove that these baselines reduce (and never increase) the variance of policy gradient estimators compared to traditional state value functions. While the posterior value function is motivated by partial observability, we extend these results to arbitrary stochastic MDPs by showing that hindsight-capable agents can model stochasticity in the environment as a special case of partial observability. Finally, we introduce a pair of methods for learning posterior value functions and prove their convergence.

Cite this Paper

BibTeX

@InProceedings{pmlr-v139-nota21a,
  title = 	 {Posterior Value Functions: Hindsight Baselines for Policy Gradient Methods},
  author =       {Nota, Chris and Thomas, Philip and Silva, Bruno C. Da},
  booktitle = 	 {Proceedings of the 38th International Conference on Machine Learning},
  pages = 	 {8238--8247},
  year = 	 {2021},
  editor = 	 {Meila, Marina and Zhang, Tong},
  volume = 	 {139},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {18--24 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v139/nota21a/nota21a.pdf},
  url = 	 {https://proceedings.mlr.press/v139/nota21a.html},
  abstract = 	 {Hindsight allows reinforcement learning agents to leverage new observations to make inferences about earlier states and transitions. In this paper, we exploit the idea of hindsight and introduce posterior value functions. Posterior value functions are computed by inferring the posterior distribution over hidden components of the state in previous timesteps and can be used to construct novel unbiased baselines for policy gradient methods. Importantly, we prove that these baselines reduce (and never increase) the variance of policy gradient estimators compared to traditional state value functions. While the posterior value function is motivated by partial observability, we extend these results to arbitrary stochastic MDPs by showing that hindsight-capable agents can model stochasticity in the environment as a special case of partial observability. Finally, we introduce a pair of methods for learning posterior value functions and prove their convergence.}
}

Endnote

%0 Conference Paper
%T Posterior Value Functions: Hindsight Baselines for Policy Gradient Methods
%A Chris Nota
%A Philip Thomas
%A Bruno C. Da Silva
%B Proceedings of the 38th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2021
%E Marina Meila
%E Tong Zhang	
%F pmlr-v139-nota21a
%I PMLR
%P 8238--8247
%U https://proceedings.mlr.press/v139/nota21a.html
%V 139
%X Hindsight allows reinforcement learning agents to leverage new observations to make inferences about earlier states and transitions. In this paper, we exploit the idea of hindsight and introduce posterior value functions. Posterior value functions are computed by inferring the posterior distribution over hidden components of the state in previous timesteps and can be used to construct novel unbiased baselines for policy gradient methods. Importantly, we prove that these baselines reduce (and never increase) the variance of policy gradient estimators compared to traditional state value functions. While the posterior value function is motivated by partial observability, we extend these results to arbitrary stochastic MDPs by showing that hindsight-capable agents can model stochasticity in the environment as a special case of partial observability. Finally, we introduce a pair of methods for learning posterior value functions and prove their convergence.

APA

Nota, C., Thomas, P. & Silva, B.C.D.. (2021). Posterior Value Functions: Hindsight Baselines for Policy Gradient Methods. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:8238-8247 Available from https://proceedings.mlr.press/v139/nota21a.html.

Posterior Value Functions: Hindsight Baselines for Policy Gradient Methods

Abstract

Cite this Paper

Related Material