Learning Adversarial Markov Decision Processes with Bandit Feedback and Unknown Transition

Chi Jin; Tiancheng Jin; Haipeng Luo; Suvrit Sra; Tiancheng Yu

Learning Adversarial Markov Decision Processes with Bandit Feedback and Unknown Transition

Chi Jin, Tiancheng Jin, Haipeng Luo, Suvrit Sra, Tiancheng Yu

Proceedings of the 37th International Conference on Machine Learning, PMLR 119:4860-4869, 2020.

Abstract

We consider the task of learning in episodic finite-horizon Markov decision processes with an unknown transition function, bandit feedback, and adversarial losses. We propose an efficient algorithm that achieves $\mathcal{\tilde{O}}(L|X|\sqrt{|A|T})$ regret with high probability, where $L$ is the horizon, $|X|$ the number of states, $|A|$ the number of actions, and T the number of episodes. To our knowledge, our algorithm is the first to ensure $\mathcal{\tilde{O}}(\sqrt{T})$ regret in this challenging setting; in fact, it achieves the same regret as (Rosenberg & Mansour, 2019a) who consider the easier setting with full-information. Our key contributions are two-fold: a tighter confidence set for the transition function; and an optimistic loss estimator that is inversely weighted by an "upper occupancy bound".

Cite this Paper

BibTeX

@InProceedings{pmlr-v119-jin20c,
  title = 	 {Learning Adversarial {M}arkov Decision Processes with Bandit Feedback and Unknown Transition},
  author =       {Jin, Chi and Jin, Tiancheng and Luo, Haipeng and Sra, Suvrit and Yu, Tiancheng},
  booktitle = 	 {Proceedings of the 37th International Conference on Machine Learning},
  pages = 	 {4860--4869},
  year = 	 {2020},
  editor = 	 {III, Hal Daumé and Singh, Aarti},
  volume = 	 {119},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--18 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v119/jin20c/jin20c.pdf},
  url = 	 {https://proceedings.mlr.press/v119/jin20c.html},
  abstract = 	 {We consider the task of learning in episodic finite-horizon Markov decision processes with an unknown transition function, bandit feedback, and adversarial losses. We propose an efficient algorithm that achieves $\mathcal{\tilde{O}}(L|X|\sqrt{|A|T})$ regret with high probability, where $L$ is the horizon, $|X|$ the number of states, $|A|$ the number of actions, and T the number of episodes. To our knowledge, our algorithm is the first to ensure $\mathcal{\tilde{O}}(\sqrt{T})$ regret in this challenging setting; in fact, it achieves the same regret as (Rosenberg & Mansour, 2019a) who consider the easier setting with full-information. Our key contributions are two-fold: a tighter confidence set for the transition function; and an optimistic loss estimator that is inversely weighted by an "upper occupancy bound".}
}

Endnote

%0 Conference Paper
%T Learning Adversarial Markov Decision Processes with Bandit Feedback and Unknown Transition
%A Chi Jin
%A Tiancheng Jin
%A Haipeng Luo
%A Suvrit Sra
%A Tiancheng Yu
%B Proceedings of the 37th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2020
%E Hal Daumé III
%E Aarti Singh	
%F pmlr-v119-jin20c
%I PMLR
%P 4860--4869
%U https://proceedings.mlr.press/v119/jin20c.html
%V 119
%X We consider the task of learning in episodic finite-horizon Markov decision processes with an unknown transition function, bandit feedback, and adversarial losses. We propose an efficient algorithm that achieves $\mathcal{\tilde{O}}(L|X|\sqrt{|A|T})$ regret with high probability, where $L$ is the horizon, $|X|$ the number of states, $|A|$ the number of actions, and T the number of episodes. To our knowledge, our algorithm is the first to ensure $\mathcal{\tilde{O}}(\sqrt{T})$ regret in this challenging setting; in fact, it achieves the same regret as (Rosenberg & Mansour, 2019a) who consider the easier setting with full-information. Our key contributions are two-fold: a tighter confidence set for the transition function; and an optimistic loss estimator that is inversely weighted by an "upper occupancy bound".

APA

Jin, C., Jin, T., Luo, H., Sra, S. & Yu, T.. (2020). Learning Adversarial Markov Decision Processes with Bandit Feedback and Unknown Transition. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:4860-4869 Available from https://proceedings.mlr.press/v119/jin20c.html.

Learning Adversarial Markov Decision Processes with Bandit Feedback and Unknown Transition

Abstract

Cite this Paper

Related Material