Dueling RL: Reinforcement Learning with Trajectory Preferences

Aadirupa Saha; Aldo Pacchiano; Jonathan Lee

Dueling RL: Reinforcement Learning with Trajectory Preferences

Aadirupa Saha, Aldo Pacchiano, Jonathan Lee

Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, PMLR 206:6263-6289, 2023.

Abstract

We consider the problem of preference-based reinforcement learning (PbRL), where, unlike traditional reinforcement learning (RL), an agent receives feedback only in terms of 1 bit (0/1) preferences over a trajectory pair instead of absolute rewards for it. The success of the traditional reward-based RL framework crucially depends on how accurately a system designer can express an appropriate reward function, which is often a non-trivial task. The main novelty of the our framework is the ability to learn from preference-based trajectory feedback that eliminates the need to hand-craft numeric reward models. This paper sets up a formal framework for the PbRL problem with non-markovian rewards, where the trajectory preferences are encoded by a generalized linear model of dimension

$d$ . Assuming the transition model is known, we propose an algorithm with a regret guarantee of

$\tilde {\mathcal{O}}\left( SH d \log (T / \delta) \sqrt{T} \right)$ . We further extend the above algorithm to the case of unknown transition dynamics and provide an algorithm with regret

$\widetilde{\mathcal{O}}((\sqrt{d} + H^2 + |\mathcal{S}|)\sqrt{dT} +\sqrt{|\mathcal{S}||\mathcal{A}|TH} )$ . To the best of our knowledge, our work is one of the first to give tight regret guarantees for preference-based RL problem with trajectory preferences.

Cite this Paper

BibTeX


@InProceedings{pmlr-v206-saha23a,
  title = 	 {Dueling RL: Reinforcement Learning with Trajectory Preferences},
  author =       {Saha, Aadirupa and Pacchiano, Aldo and Lee, Jonathan},
  booktitle = 	 {Proceedings of The 26th International Conference on Artificial Intelligence and Statistics},
  pages = 	 {6263--6289},
  year = 	 {2023},
  editor = 	 {Ruiz, Francisco and Dy, Jennifer and van de Meent, Jan-Willem},
  volume = 	 {206},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {25--27 Apr},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v206/saha23a/saha23a.pdf},
  url = 	 {https://proceedings.mlr.press/v206/saha23a.html},
  abstract = 	 {We consider the problem of preference-based reinforcement learning (PbRL), where, unlike traditional reinforcement learning (RL), an agent receives feedback only in terms of 1 bit (0/1) preferences over a trajectory pair instead of absolute rewards for it. The success of the traditional reward-based RL framework crucially depends on how accurately a system designer can express an appropriate reward function, which is often a non-trivial task. The main novelty of the our framework is the ability to learn from preference-based trajectory feedback that eliminates the need to hand-craft numeric reward models. This paper sets up a formal framework for the PbRL problem with non-markovian rewards, where the trajectory preferences are encoded by a generalized linear model of dimension $d$. Assuming the transition model is known, we propose an algorithm with a regret guarantee of $\tilde {\mathcal{O}}\left( SH d \log (T / \delta) \sqrt{T} \right)$. We further extend the above algorithm to the case of unknown transition dynamics and provide an algorithm with regret $\widetilde{\mathcal{O}}((\sqrt{d} + H^2 + |\mathcal{S}|)\sqrt{dT} +\sqrt{|\mathcal{S}||\mathcal{A}|TH} )$. To the best of our knowledge, our work is one of the first to give tight regret guarantees for preference-based RL problem with trajectory preferences.}
}

Endnote

%0 Conference Paper
%T Dueling RL: Reinforcement Learning with Trajectory Preferences
%A Aadirupa Saha
%A Aldo Pacchiano
%A Jonathan Lee
%B Proceedings of The 26th International Conference on Artificial Intelligence and Statistics
%C Proceedings of Machine Learning Research
%D 2023
%E Francisco Ruiz
%E Jennifer Dy
%E Jan-Willem van de Meent	
%F pmlr-v206-saha23a
%I PMLR
%P 6263--6289
%U https://proceedings.mlr.press/v206/saha23a.html
%V 206
%X We consider the problem of preference-based reinforcement learning (PbRL), where, unlike traditional reinforcement learning (RL), an agent receives feedback only in terms of 1 bit (0/1) preferences over a trajectory pair instead of absolute rewards for it. The success of the traditional reward-based RL framework crucially depends on how accurately a system designer can express an appropriate reward function, which is often a non-trivial task. The main novelty of the our framework is the ability to learn from preference-based trajectory feedback that eliminates the need to hand-craft numeric reward models. This paper sets up a formal framework for the PbRL problem with non-markovian rewards, where the trajectory preferences are encoded by a generalized linear model of dimension $d$. Assuming the transition model is known, we propose an algorithm with a regret guarantee of $\tilde {\mathcal{O}}\left( SH d \log (T / \delta) \sqrt{T} \right)$. We further extend the above algorithm to the case of unknown transition dynamics and provide an algorithm with regret $\widetilde{\mathcal{O}}((\sqrt{d} + H^2 + |\mathcal{S}|)\sqrt{dT} +\sqrt{|\mathcal{S}||\mathcal{A}|TH} )$. To the best of our knowledge, our work is one of the first to give tight regret guarantees for preference-based RL problem with trajectory preferences.

APA


Saha, A., Pacchiano, A. & Lee, J.. (2023). Dueling RL: Reinforcement Learning with Trajectory Preferences. Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 206:6263-6289 Available from https://proceedings.mlr.press/v206/saha23a.html.

Dueling RL: Reinforcement Learning with Trajectory Preferences

Abstract

Cite this Paper

Related Material