On Constant Regret for Low-Rank MDPs

Alexander Sturm; Sebastian Tschiatschek

On Constant Regret for Low-Rank MDPs

Alexander Sturm, Sebastian Tschiatschek

Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence, PMLR 286:4044-4079, 2025.

Abstract

Although there exist instance-dependent regret bounds for linear Markov decision processes (MDPs) and low-rank bandits, extensions to low-rank MDPs remain unexplored. In this work, we close this gap and provide regret bounds for low-rank MDPs in an instance-dependent setting. Specifically, we introduce an algorithm, called UniSREP-UCB, which utilizes a constrained optimization objective to learn features with good spectral properties. Furthermore, we demonstrate that our algorithm enjoys constant regret if the minimal sub-optimality gap and the occupancy distribution of the optimal policy are well-defined and known. To the best of our knowledge, these are the first instance-dependent regret results for low-rank MDPs.

Cite this Paper

BibTeX

@InProceedings{pmlr-v286-sturm25a,
  title = 	 {On Constant Regret for Low-Rank MDPs},
  author =       {Sturm, Alexander and Tschiatschek, Sebastian},
  booktitle = 	 {Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence},
  pages = 	 {4044--4079},
  year = 	 {2025},
  editor = 	 {Chiappa, Silvia and Magliacane, Sara},
  volume = 	 {286},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--25 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v286/main/assets/sturm25a/sturm25a.pdf},
  url = 	 {https://proceedings.mlr.press/v286/sturm25a.html},
  abstract = 	 {Although there exist instance-dependent regret bounds for linear Markov decision processes (MDPs) and low-rank bandits, extensions to low-rank MDPs remain unexplored. In this work, we close this gap and provide regret bounds for low-rank MDPs in an instance-dependent setting. Specifically, we introduce an algorithm, called UniSREP-UCB, which utilizes a constrained optimization objective to learn features with good spectral properties. Furthermore, we demonstrate that our algorithm enjoys constant regret if the minimal sub-optimality gap and the occupancy distribution of the optimal policy are well-defined and known. To the best of our knowledge, these are the first instance-dependent regret results for low-rank MDPs.}
}

Endnote

%0 Conference Paper
%T On Constant Regret for Low-Rank MDPs
%A Alexander Sturm
%A Sebastian Tschiatschek
%B Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence
%C Proceedings of Machine Learning Research
%D 2025
%E Silvia Chiappa
%E Sara Magliacane	
%F pmlr-v286-sturm25a
%I PMLR
%P 4044--4079
%U https://proceedings.mlr.press/v286/sturm25a.html
%V 286
%X Although there exist instance-dependent regret bounds for linear Markov decision processes (MDPs) and low-rank bandits, extensions to low-rank MDPs remain unexplored. In this work, we close this gap and provide regret bounds for low-rank MDPs in an instance-dependent setting. Specifically, we introduce an algorithm, called UniSREP-UCB, which utilizes a constrained optimization objective to learn features with good spectral properties. Furthermore, we demonstrate that our algorithm enjoys constant regret if the minimal sub-optimality gap and the occupancy distribution of the optimal policy are well-defined and known. To the best of our knowledge, these are the first instance-dependent regret results for low-rank MDPs.

APA

Sturm, A. & Tschiatschek, S.. (2025). On Constant Regret for Low-Rank MDPs. Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence, in Proceedings of Machine Learning Research 286:4044-4079 Available from https://proceedings.mlr.press/v286/sturm25a.html.

On Constant Regret for Low-Rank MDPs

Abstract

Cite this Paper

Related Material