Adversarial Online Multi-Task Reinforcement Learning

Quan Nguyen; Nishant Mehta

Adversarial Online Multi-Task Reinforcement Learning

Quan Nguyen, Nishant Mehta

Proceedings of The 34th International Conference on Algorithmic Learning Theory, PMLR 201:1124-1165, 2023.

Abstract

We consider the adversarial online multi-task reinforcement learning setting, where in each of $K$ episodes the learner is given an unknown task taken from a finite set of $M$ unknown finite-horizon MDP models. The learner’s objective is to minimize its regret with respect to the optimal policy for each task. We assume the MDPs in $\mathcal{M}$ are well-separated under a notion of $\lambda$-separability, and show that this notion generalizes many task-separability notions from previous works. We prove a minimax lower bound of $\Omega(K\sqrt{DSAH})$ on the regret of any learning algorithm and an instance-specific lower bound of $\Omega(\frac{K}{\lambda^2})$ in sample complexity for a class of \emph{uniformly good} cluster-then-learn algorithms. We use a novel construction called $\emph{2-JAO MDP}$ for proving the instance-specific lower bound. The lower bounds are complemented with a polynomial time algorithm that obtains $\tilde{O}(\frac{K}{\lambda^2})$ sample complexity guarantee for the clustering phase and $\tilde{O}(\sqrt{MK})$ regret guarantee for the learning phase, indicating that the dependency on $K$ and $\frac{1}{\lambda^2}$ is tight.

Cite this Paper

BibTeX

@InProceedings{pmlr-v201-nguyen23a,
  title = 	 {Adversarial Online Multi-Task Reinforcement Learning},
  author =       {Nguyen, Quan and Mehta, Nishant},
  booktitle = 	 {Proceedings of The 34th International Conference on Algorithmic Learning Theory},
  pages = 	 {1124--1165},
  year = 	 {2023},
  editor = 	 {Agrawal, Shipra and Orabona, Francesco},
  volume = 	 {201},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {20 Feb--23 Feb},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v201/nguyen23a/nguyen23a.pdf},
  url = 	 {https://proceedings.mlr.press/v201/nguyen23a.html},
  abstract = 	 { We consider the adversarial online multi-task reinforcement learning setting, where in each of $K$ episodes the learner is given an unknown task taken from a finite set of $M$ unknown finite-horizon MDP models. The learner’s objective is to minimize its regret with respect to the optimal policy for each task. We assume the MDPs in $\mathcal{M}$ are well-separated under a notion of $\lambda$-separability, and show that this notion generalizes many task-separability notions from previous works. We prove a minimax lower bound of $\Omega(K\sqrt{DSAH})$ on the regret of any learning algorithm and an instance-specific lower bound of $\Omega(\frac{K}{\lambda^2})$ in sample complexity for a class of \emph{uniformly good} cluster-then-learn algorithms. We use a novel construction called $\emph{2-JAO MDP}$ for proving the instance-specific lower bound. The lower bounds are complemented with a polynomial time algorithm that obtains $\tilde{O}(\frac{K}{\lambda^2})$ sample complexity guarantee for the clustering phase and $\tilde{O}(\sqrt{MK})$ regret guarantee for the learning phase, indicating that the dependency on $K$ and $\frac{1}{\lambda^2}$ is tight.}
}

Endnote

%0 Conference Paper
%T Adversarial Online Multi-Task Reinforcement Learning
%A Quan Nguyen
%A Nishant Mehta
%B Proceedings of The 34th International Conference on Algorithmic Learning Theory
%C Proceedings of Machine Learning Research
%D 2023
%E Shipra Agrawal
%E Francesco Orabona	
%F pmlr-v201-nguyen23a
%I PMLR
%P 1124--1165
%U https://proceedings.mlr.press/v201/nguyen23a.html
%V 201
%X  We consider the adversarial online multi-task reinforcement learning setting, where in each of $K$ episodes the learner is given an unknown task taken from a finite set of $M$ unknown finite-horizon MDP models. The learner’s objective is to minimize its regret with respect to the optimal policy for each task. We assume the MDPs in $\mathcal{M}$ are well-separated under a notion of $\lambda$-separability, and show that this notion generalizes many task-separability notions from previous works. We prove a minimax lower bound of $\Omega(K\sqrt{DSAH})$ on the regret of any learning algorithm and an instance-specific lower bound of $\Omega(\frac{K}{\lambda^2})$ in sample complexity for a class of \emph{uniformly good} cluster-then-learn algorithms. We use a novel construction called $\emph{2-JAO MDP}$ for proving the instance-specific lower bound. The lower bounds are complemented with a polynomial time algorithm that obtains $\tilde{O}(\frac{K}{\lambda^2})$ sample complexity guarantee for the clustering phase and $\tilde{O}(\sqrt{MK})$ regret guarantee for the learning phase, indicating that the dependency on $K$ and $\frac{1}{\lambda^2}$ is tight.

APA

Nguyen, Q. & Mehta, N.. (2023). Adversarial Online Multi-Task Reinforcement Learning. Proceedings of The 34th International Conference on Algorithmic Learning Theory, in Proceedings of Machine Learning Research 201:1124-1165 Available from https://proceedings.mlr.press/v201/nguyen23a.html.

Related Material

Download PDF