Minimax Optimal Regret Bound for Reinforcement Learning with Trajectory Feedback

Zihan Zhang, Yuxin Chen, Jason D. Lee, Simon Shaolei Du, Ruosong Wang
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:74692-74713, 2025.

Abstract

In this work, we study reinforcement learning (RL) with trajectory feedback. Compared to the standard RL setting, in RL with trajectory feedback, the agent only observes the accumulative reward along the trajectory, and therefore, this model is particularly suitable for scenarios where querying the reward in each single step incurs prohibitive cost. For a finite-horizon Markov Decision Process (MDP) with $S$ states, $A$ actions and a horizon length of $H$, we develop an algorithm that enjoys an asymptotically nearly optimal regret of $\tilde{O}\left(\sqrt{SAH^3K}\right)$ in $K$ episodes. To achieve this result, our new technical ingredients include (i) constructing a tighter confidence region for the reward function by incorporating the RL with trajectory feedback setting with techniques in linear bandits and (ii) constructing a reference transition model to better guide the exploration process.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-zhang25l, title = {Minimax Optimal Regret Bound for Reinforcement Learning with Trajectory Feedback}, author = {Zhang, Zihan and Chen, Yuxin and Lee, Jason D. and Du, Simon Shaolei and Wang, Ruosong}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {74692--74713}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/zhang25l/zhang25l.pdf}, url = {https://proceedings.mlr.press/v267/zhang25l.html}, abstract = {In this work, we study reinforcement learning (RL) with trajectory feedback. Compared to the standard RL setting, in RL with trajectory feedback, the agent only observes the accumulative reward along the trajectory, and therefore, this model is particularly suitable for scenarios where querying the reward in each single step incurs prohibitive cost. For a finite-horizon Markov Decision Process (MDP) with $S$ states, $A$ actions and a horizon length of $H$, we develop an algorithm that enjoys an asymptotically nearly optimal regret of $\tilde{O}\left(\sqrt{SAH^3K}\right)$ in $K$ episodes. To achieve this result, our new technical ingredients include (i) constructing a tighter confidence region for the reward function by incorporating the RL with trajectory feedback setting with techniques in linear bandits and (ii) constructing a reference transition model to better guide the exploration process.} }
Endnote
%0 Conference Paper %T Minimax Optimal Regret Bound for Reinforcement Learning with Trajectory Feedback %A Zihan Zhang %A Yuxin Chen %A Jason D. Lee %A Simon Shaolei Du %A Ruosong Wang %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-zhang25l %I PMLR %P 74692--74713 %U https://proceedings.mlr.press/v267/zhang25l.html %V 267 %X In this work, we study reinforcement learning (RL) with trajectory feedback. Compared to the standard RL setting, in RL with trajectory feedback, the agent only observes the accumulative reward along the trajectory, and therefore, this model is particularly suitable for scenarios where querying the reward in each single step incurs prohibitive cost. For a finite-horizon Markov Decision Process (MDP) with $S$ states, $A$ actions and a horizon length of $H$, we develop an algorithm that enjoys an asymptotically nearly optimal regret of $\tilde{O}\left(\sqrt{SAH^3K}\right)$ in $K$ episodes. To achieve this result, our new technical ingredients include (i) constructing a tighter confidence region for the reward function by incorporating the RL with trajectory feedback setting with techniques in linear bandits and (ii) constructing a reference transition model to better guide the exploration process.
APA
Zhang, Z., Chen, Y., Lee, J.D., Du, S.S. & Wang, R.. (2025). Minimax Optimal Regret Bound for Reinforcement Learning with Trajectory Feedback. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:74692-74713 Available from https://proceedings.mlr.press/v267/zhang25l.html.

Related Material