Achieving $\widetilde\mathcalO(\sqrtT)$ Regret in Average-Reward POMDPs with Known Observation Models

Alessio Russo; Alberto Maria Metelli; Marcello Restelli

Achieving $\widetilde\mathcalO(\sqrtT)$ Regret in Average-Reward POMDPs with Known Observation Models

Alessio Russo, Alberto Maria Metelli, Marcello Restelli

Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, PMLR 258:4168-4176, 2025.

Abstract

We tackle average-reward infinite-horizon POMDPs with an unknown transition model but a known observation model, a setting that has been previously addressed in two limiting ways: (i) frequentist methods relying on suboptimal stochastic policies having a minimum probability of choosing each action, and (ii) Bayesian approaches employing the optimal policy class but requiring strong assumptions about the consistency of employed estimators. Our work removes these limitations by proving convenient estimation guarantees for the transition model and introducing an optimistic algorithm that leverages the optimal class of deterministic belief-based policies. We introduce modifications to existing estimation techniques providing theoretical guarantees separately for each estimated action transition matrix. Unlike existing estimation methods that are unable to use samples from different policies, we present a novel and simple estimator that overcomes this barrier. This new data-efficient technique, combined with the proposed $\textit{Action-wise OAS-UCRL}$ algorithm and a tighter theoretical analysis, leads to the first approach enjoying a regret guarantee of order $\mathcal{O}(\sqrt{T \log T})$ when compared against the optimal policy, thus improving over state of the art techniques. Finally, theoretical results are validated through numerical simulations showing the efficacy of our method against baseline methods.

Cite this Paper

BibTeX

@InProceedings{pmlr-v258-russo25b,
  title = 	 {Achieving $\widetilde{\mathcal{O}}(\sqrt{T})$ Regret in Average-Reward POMDPs with Known Observation Models},
  author =       {Russo, Alessio and Metelli, Alberto Maria and Restelli, Marcello},
  booktitle = 	 {Proceedings of The 28th International Conference on Artificial Intelligence and Statistics},
  pages = 	 {4168--4176},
  year = 	 {2025},
  editor = 	 {Li, Yingzhen and Mandt, Stephan and Agrawal, Shipra and Khan, Emtiyaz},
  volume = 	 {258},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {03--05 May},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v258/main/assets/russo25b/russo25b.pdf},
  url = 	 {https://proceedings.mlr.press/v258/russo25b.html},
  abstract = 	 {We tackle average-reward infinite-horizon POMDPs with an unknown transition model but a known observation model, a setting that has been previously addressed in two limiting ways: (i) frequentist methods relying on suboptimal stochastic policies having a minimum probability of choosing each action, and (ii) Bayesian approaches employing the optimal policy class but requiring strong assumptions about the consistency of employed estimators. Our work removes these limitations by proving convenient estimation guarantees for the transition model and introducing an optimistic algorithm that leverages the optimal class of deterministic belief-based policies.   We introduce modifications to existing estimation techniques providing theoretical guarantees separately for each estimated action transition matrix. Unlike existing estimation methods that are unable to use samples from different policies, we present a novel and simple estimator that overcomes this barrier. This new data-efficient technique, combined with the proposed $\textit{Action-wise OAS-UCRL}$ algorithm and a tighter theoretical analysis, leads to the first approach enjoying a regret guarantee of order $\mathcal{O}(\sqrt{T \log T})$ when compared against the optimal policy, thus improving over state of the art techniques. Finally, theoretical results are validated through numerical simulations showing the efficacy of our method against baseline methods.}
}

Endnote

%0 Conference Paper
%T Achieving $\widetilde\mathcalO(\sqrtT)$ Regret in Average-Reward POMDPs with Known Observation Models
%A Alessio Russo
%A Alberto Maria Metelli
%A Marcello Restelli
%B Proceedings of The 28th International Conference on Artificial Intelligence and Statistics
%C Proceedings of Machine Learning Research
%D 2025
%E Yingzhen Li
%E Stephan Mandt
%E Shipra Agrawal
%E Emtiyaz Khan	
%F pmlr-v258-russo25b
%I PMLR
%P 4168--4176
%U https://proceedings.mlr.press/v258/russo25b.html
%V 258
%X We tackle average-reward infinite-horizon POMDPs with an unknown transition model but a known observation model, a setting that has been previously addressed in two limiting ways: (i) frequentist methods relying on suboptimal stochastic policies having a minimum probability of choosing each action, and (ii) Bayesian approaches employing the optimal policy class but requiring strong assumptions about the consistency of employed estimators. Our work removes these limitations by proving convenient estimation guarantees for the transition model and introducing an optimistic algorithm that leverages the optimal class of deterministic belief-based policies.   We introduce modifications to existing estimation techniques providing theoretical guarantees separately for each estimated action transition matrix. Unlike existing estimation methods that are unable to use samples from different policies, we present a novel and simple estimator that overcomes this barrier. This new data-efficient technique, combined with the proposed $\textit{Action-wise OAS-UCRL}$ algorithm and a tighter theoretical analysis, leads to the first approach enjoying a regret guarantee of order $\mathcal{O}(\sqrt{T \log T})$ when compared against the optimal policy, thus improving over state of the art techniques. Finally, theoretical results are validated through numerical simulations showing the efficacy of our method against baseline methods.

APA

Russo, A., Metelli, A.M. & Restelli, M.. (2025). Achieving $\widetilde\mathcalO(\sqrtT)$ Regret in Average-Reward POMDPs with Known Observation Models. Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 258:4168-4176 Available from https://proceedings.mlr.press/v258/russo25b.html.

Achieving $\widetilde\mathcalO(\sqrtT)$ Regret in Average-Reward POMDPs with Known Observation Models

Abstract

Cite this Paper

Related Material