On Constant Regret for Low-Rank MDPs

Alexander Sturm, Sebastian Tschiatschek
Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence, PMLR 286:4044-4079, 2025.

Abstract

Although there exist instance-dependent regret bounds for linear Markov decision processes (MDPs) and low-rank bandits, extensions to low-rank MDPs remain unexplored. In this work, we close this gap and provide regret bounds for low-rank MDPs in an instance-dependent setting. Specifically, we introduce an algorithm, called UniSREP-UCB, which utilizes a constrained optimization objective to learn features with good spectral properties. Furthermore, we demonstrate that our algorithm enjoys constant regret if the minimal sub-optimality gap and the occupancy distribution of the optimal policy are well-defined and known. To the best of our knowledge, these are the first instance-dependent regret results for low-rank MDPs.

Cite this Paper


BibTeX
@InProceedings{pmlr-v286-sturm25a, title = {On Constant Regret for Low-Rank MDPs}, author = {Sturm, Alexander and Tschiatschek, Sebastian}, booktitle = {Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence}, pages = {4044--4079}, year = {2025}, editor = {Chiappa, Silvia and Magliacane, Sara}, volume = {286}, series = {Proceedings of Machine Learning Research}, month = {21--25 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v286/main/assets/sturm25a/sturm25a.pdf}, url = {https://proceedings.mlr.press/v286/sturm25a.html}, abstract = {Although there exist instance-dependent regret bounds for linear Markov decision processes (MDPs) and low-rank bandits, extensions to low-rank MDPs remain unexplored. In this work, we close this gap and provide regret bounds for low-rank MDPs in an instance-dependent setting. Specifically, we introduce an algorithm, called UniSREP-UCB, which utilizes a constrained optimization objective to learn features with good spectral properties. Furthermore, we demonstrate that our algorithm enjoys constant regret if the minimal sub-optimality gap and the occupancy distribution of the optimal policy are well-defined and known. To the best of our knowledge, these are the first instance-dependent regret results for low-rank MDPs.} }
Endnote
%0 Conference Paper %T On Constant Regret for Low-Rank MDPs %A Alexander Sturm %A Sebastian Tschiatschek %B Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence %C Proceedings of Machine Learning Research %D 2025 %E Silvia Chiappa %E Sara Magliacane %F pmlr-v286-sturm25a %I PMLR %P 4044--4079 %U https://proceedings.mlr.press/v286/sturm25a.html %V 286 %X Although there exist instance-dependent regret bounds for linear Markov decision processes (MDPs) and low-rank bandits, extensions to low-rank MDPs remain unexplored. In this work, we close this gap and provide regret bounds for low-rank MDPs in an instance-dependent setting. Specifically, we introduce an algorithm, called UniSREP-UCB, which utilizes a constrained optimization objective to learn features with good spectral properties. Furthermore, we demonstrate that our algorithm enjoys constant regret if the minimal sub-optimality gap and the occupancy distribution of the optimal policy are well-defined and known. To the best of our knowledge, these are the first instance-dependent regret results for low-rank MDPs.
APA
Sturm, A. & Tschiatschek, S.. (2025). On Constant Regret for Low-Rank MDPs. Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence, in Proceedings of Machine Learning Research 286:4044-4079 Available from https://proceedings.mlr.press/v286/sturm25a.html.

Related Material