Learning in POMDPs is Sample-Efficient with Hindsight Observability

Jonathan Lee, Alekh Agarwal, Christoph Dann, Tong Zhang
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:18733-18773, 2023.

Abstract

POMDPs capture a broad class of decision making problems, but hardness results suggest that learning is intractable even in simple settings due to the inherent partial observability. However, in many realistic problems, more information is either revealed or can be computed during some point of the learning process. Motivated by diverse applications ranging from robotics to data center scheduling, we formulate a Hindsight Observable Markov Decision Process (HOMDP) as a POMDP where the latent states are revealed to the learner in hindsight and only during training. We introduce new algorithms for the tabular and function approximation settings that are provably sample-efficient with hindsight observability, even in POMDPs that would otherwise be statistically intractable. We give a lower bound showing that the tabular algorithm is optimal in its dependence on latent state and observation cardinalities.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-lee23a, title = {Learning in {POMDP}s is Sample-Efficient with Hindsight Observability}, author = {Lee, Jonathan and Agarwal, Alekh and Dann, Christoph and Zhang, Tong}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {18733--18773}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/lee23a/lee23a.pdf}, url = {https://proceedings.mlr.press/v202/lee23a.html}, abstract = {POMDPs capture a broad class of decision making problems, but hardness results suggest that learning is intractable even in simple settings due to the inherent partial observability. However, in many realistic problems, more information is either revealed or can be computed during some point of the learning process. Motivated by diverse applications ranging from robotics to data center scheduling, we formulate a Hindsight Observable Markov Decision Process (HOMDP) as a POMDP where the latent states are revealed to the learner in hindsight and only during training. We introduce new algorithms for the tabular and function approximation settings that are provably sample-efficient with hindsight observability, even in POMDPs that would otherwise be statistically intractable. We give a lower bound showing that the tabular algorithm is optimal in its dependence on latent state and observation cardinalities.} }
Endnote
%0 Conference Paper %T Learning in POMDPs is Sample-Efficient with Hindsight Observability %A Jonathan Lee %A Alekh Agarwal %A Christoph Dann %A Tong Zhang %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-lee23a %I PMLR %P 18733--18773 %U https://proceedings.mlr.press/v202/lee23a.html %V 202 %X POMDPs capture a broad class of decision making problems, but hardness results suggest that learning is intractable even in simple settings due to the inherent partial observability. However, in many realistic problems, more information is either revealed or can be computed during some point of the learning process. Motivated by diverse applications ranging from robotics to data center scheduling, we formulate a Hindsight Observable Markov Decision Process (HOMDP) as a POMDP where the latent states are revealed to the learner in hindsight and only during training. We introduce new algorithms for the tabular and function approximation settings that are provably sample-efficient with hindsight observability, even in POMDPs that would otherwise be statistically intractable. We give a lower bound showing that the tabular algorithm is optimal in its dependence on latent state and observation cardinalities.
APA
Lee, J., Agarwal, A., Dann, C. & Zhang, T.. (2023). Learning in POMDPs is Sample-Efficient with Hindsight Observability. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:18733-18773 Available from https://proceedings.mlr.press/v202/lee23a.html.

Related Material