Infinite-Horizon Reinforcement Learning with Multinomial Logit Function Approximation

Jaehyun Park, Junyeop Kwon, Dabeen Lee
Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, PMLR 258:361-369, 2025.

Abstract

We study model-based reinforcement learning with non-linear function approximation where the transition function of the underlying Markov decision process (MDP) is given by a multinomial logit (MNL) model. We develop a provably efficient discounted value iteration-based algorithm that works for both infinite-horizon average-reward and discounted-reward settings. For average-reward communicating MDPs, the algorithm guarantees a regret upper bound of $\tilde{\mathcal{O}}(dD\sqrt{T})$ where $d$ is the dimension of feature mapping, $D$ is the diameter of the underlying MDP, and $T$ is the horizon. For discounted-reward MDPs, our algorithm achieves $\tilde{\mathcal{O}}(d(1-\gamma)^{-2}\sqrt{T})$ regret where $\gamma$ is the discount factor. Then we complement these upper bounds by providing several regret lower bounds. We prove a lower bound of $\Omega(d\sqrt{DT})$ for learning communicating MDPs of diameter $D$ and a lower bound of $\Omega(d(1-\gamma)^{-3/2}\sqrt{T})$ for learning discounted-reward MDPs with discount factor $\gamma$. Lastly, we show a regret lower bound of $\Omega(dH^{3/2}\sqrt{K})$ for learning $H$-horizon episodic MDPs with MNL function approximation where $K$ is the number of episodes, which improves upon the best-known lower bound for the finite-horizon setting.

Cite this Paper


BibTeX
@InProceedings{pmlr-v258-park25a, title = {Infinite-Horizon Reinforcement Learning with Multinomial Logit Function Approximation}, author = {Park, Jaehyun and Kwon, Junyeop and Lee, Dabeen}, booktitle = {Proceedings of The 28th International Conference on Artificial Intelligence and Statistics}, pages = {361--369}, year = {2025}, editor = {Li, Yingzhen and Mandt, Stephan and Agrawal, Shipra and Khan, Emtiyaz}, volume = {258}, series = {Proceedings of Machine Learning Research}, month = {03--05 May}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v258/main/assets/park25a/park25a.pdf}, url = {https://proceedings.mlr.press/v258/park25a.html}, abstract = {We study model-based reinforcement learning with non-linear function approximation where the transition function of the underlying Markov decision process (MDP) is given by a multinomial logit (MNL) model. We develop a provably efficient discounted value iteration-based algorithm that works for both infinite-horizon average-reward and discounted-reward settings. For average-reward communicating MDPs, the algorithm guarantees a regret upper bound of $\tilde{\mathcal{O}}(dD\sqrt{T})$ where $d$ is the dimension of feature mapping, $D$ is the diameter of the underlying MDP, and $T$ is the horizon. For discounted-reward MDPs, our algorithm achieves $\tilde{\mathcal{O}}(d(1-\gamma)^{-2}\sqrt{T})$ regret where $\gamma$ is the discount factor. Then we complement these upper bounds by providing several regret lower bounds. We prove a lower bound of $\Omega(d\sqrt{DT})$ for learning communicating MDPs of diameter $D$ and a lower bound of $\Omega(d(1-\gamma)^{-3/2}\sqrt{T})$ for learning discounted-reward MDPs with discount factor $\gamma$. Lastly, we show a regret lower bound of $\Omega(dH^{3/2}\sqrt{K})$ for learning $H$-horizon episodic MDPs with MNL function approximation where $K$ is the number of episodes, which improves upon the best-known lower bound for the finite-horizon setting.} }
Endnote
%0 Conference Paper %T Infinite-Horizon Reinforcement Learning with Multinomial Logit Function Approximation %A Jaehyun Park %A Junyeop Kwon %A Dabeen Lee %B Proceedings of The 28th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2025 %E Yingzhen Li %E Stephan Mandt %E Shipra Agrawal %E Emtiyaz Khan %F pmlr-v258-park25a %I PMLR %P 361--369 %U https://proceedings.mlr.press/v258/park25a.html %V 258 %X We study model-based reinforcement learning with non-linear function approximation where the transition function of the underlying Markov decision process (MDP) is given by a multinomial logit (MNL) model. We develop a provably efficient discounted value iteration-based algorithm that works for both infinite-horizon average-reward and discounted-reward settings. For average-reward communicating MDPs, the algorithm guarantees a regret upper bound of $\tilde{\mathcal{O}}(dD\sqrt{T})$ where $d$ is the dimension of feature mapping, $D$ is the diameter of the underlying MDP, and $T$ is the horizon. For discounted-reward MDPs, our algorithm achieves $\tilde{\mathcal{O}}(d(1-\gamma)^{-2}\sqrt{T})$ regret where $\gamma$ is the discount factor. Then we complement these upper bounds by providing several regret lower bounds. We prove a lower bound of $\Omega(d\sqrt{DT})$ for learning communicating MDPs of diameter $D$ and a lower bound of $\Omega(d(1-\gamma)^{-3/2}\sqrt{T})$ for learning discounted-reward MDPs with discount factor $\gamma$. Lastly, we show a regret lower bound of $\Omega(dH^{3/2}\sqrt{K})$ for learning $H$-horizon episodic MDPs with MNL function approximation where $K$ is the number of episodes, which improves upon the best-known lower bound for the finite-horizon setting.
APA
Park, J., Kwon, J. & Lee, D.. (2025). Infinite-Horizon Reinforcement Learning with Multinomial Logit Function Approximation. Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 258:361-369 Available from https://proceedings.mlr.press/v258/park25a.html.

Related Material