Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit Feedback and Unknown Transition

Long-Fei Li, Peng Zhao, Zhi-Hua Zhou
Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, PMLR 238:3061-3069, 2024.

Abstract

We study reinforcement learning with linear function approximation, unknown transition, and adversarial losses in the bandit feedback setting. Specifically, we focus on linear mixture MDPs whose transition kernel is a linear mixture model. We propose a new algorithm that attains an $\tilde{\mathcal{O}}(d\sqrt{HS^3K} + \sqrt{HSAK})$ regret with high probability, where $d$ is the dimension of feature mappings, $S$ is the size of state space, $A$ is the size of action space, $H$ is the episode length and $K$ is the number of episodes. Our result strictly improves the previous best-known $\tilde{\mathcal{O}}(dS^2 \sqrt{K} + \sqrt{HSAK})$ result in Zhao et al. (2023a) since $H \leq S$ holds by the layered MDP structure. Our advancements are primarily attributed to (\romannumeral1) a new least square estimator for the transition parameter that leverages the visit information of all states, as opposed to only one state in prior work, and (\romannumeral2) a new self-normalized concentration tailored specifically to handle non-independent noises, originally proposed in the dynamic assortment area and firstly applied in reinforcement learning to handle correlations between different states.

Cite this Paper


BibTeX
@InProceedings{pmlr-v238-li24n, title = {Improved Algorithm for Adversarial Linear Mixture {MDPs} with Bandit Feedback and Unknown Transition}, author = {Li, Long-Fei and Zhao, Peng and Zhou, Zhi-Hua}, booktitle = {Proceedings of The 27th International Conference on Artificial Intelligence and Statistics}, pages = {3061--3069}, year = {2024}, editor = {Dasgupta, Sanjoy and Mandt, Stephan and Li, Yingzhen}, volume = {238}, series = {Proceedings of Machine Learning Research}, month = {02--04 May}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v238/li24n/li24n.pdf}, url = {https://proceedings.mlr.press/v238/li24n.html}, abstract = {We study reinforcement learning with linear function approximation, unknown transition, and adversarial losses in the bandit feedback setting. Specifically, we focus on linear mixture MDPs whose transition kernel is a linear mixture model. We propose a new algorithm that attains an $\tilde{\mathcal{O}}(d\sqrt{HS^3K} + \sqrt{HSAK})$ regret with high probability, where $d$ is the dimension of feature mappings, $S$ is the size of state space, $A$ is the size of action space, $H$ is the episode length and $K$ is the number of episodes. Our result strictly improves the previous best-known $\tilde{\mathcal{O}}(dS^2 \sqrt{K} + \sqrt{HSAK})$ result in Zhao et al. (2023a) since $H \leq S$ holds by the layered MDP structure. Our advancements are primarily attributed to (\romannumeral1) a new least square estimator for the transition parameter that leverages the visit information of all states, as opposed to only one state in prior work, and (\romannumeral2) a new self-normalized concentration tailored specifically to handle non-independent noises, originally proposed in the dynamic assortment area and firstly applied in reinforcement learning to handle correlations between different states.} }
Endnote
%0 Conference Paper %T Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit Feedback and Unknown Transition %A Long-Fei Li %A Peng Zhao %A Zhi-Hua Zhou %B Proceedings of The 27th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2024 %E Sanjoy Dasgupta %E Stephan Mandt %E Yingzhen Li %F pmlr-v238-li24n %I PMLR %P 3061--3069 %U https://proceedings.mlr.press/v238/li24n.html %V 238 %X We study reinforcement learning with linear function approximation, unknown transition, and adversarial losses in the bandit feedback setting. Specifically, we focus on linear mixture MDPs whose transition kernel is a linear mixture model. We propose a new algorithm that attains an $\tilde{\mathcal{O}}(d\sqrt{HS^3K} + \sqrt{HSAK})$ regret with high probability, where $d$ is the dimension of feature mappings, $S$ is the size of state space, $A$ is the size of action space, $H$ is the episode length and $K$ is the number of episodes. Our result strictly improves the previous best-known $\tilde{\mathcal{O}}(dS^2 \sqrt{K} + \sqrt{HSAK})$ result in Zhao et al. (2023a) since $H \leq S$ holds by the layered MDP structure. Our advancements are primarily attributed to (\romannumeral1) a new least square estimator for the transition parameter that leverages the visit information of all states, as opposed to only one state in prior work, and (\romannumeral2) a new self-normalized concentration tailored specifically to handle non-independent noises, originally proposed in the dynamic assortment area and firstly applied in reinforcement learning to handle correlations between different states.
APA
Li, L., Zhao, P. & Zhou, Z.. (2024). Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit Feedback and Unknown Transition. Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 238:3061-3069 Available from https://proceedings.mlr.press/v238/li24n.html.

Related Material