One Shot Inverse Reinforcement Learning for Stochastic Linear Bandits

Etash Guha, Jim James, Krishna Acharya, Vidya Muthukumar, Ashwin Pananjady
Proceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence, PMLR 244:1491-1512, 2024.

Abstract

The paradigm of inverse reinforcement learning (IRL) is used to specify the reward function of an agent purely from its actions and is critical for value alignment and AI safety. While IRL is successful in practice, theoretical guarantees remain nascent. Motivated by the need for IRL in large action spaces with limited data, we consider as a first step the problem of learning from a single sequence of actions (i.e., a demonstration) of a stochastic linear bandit algorithm. When the demonstrator employs the Phased Elimination algorithm, we develop a simple inverse learning procedure that estimates the linear reward function consistently in the time horizon with just a single demonstration. In particular, we show that our inverse learner approximates the true reward parameter within a error of $\mathcal{O}(T^{-\frac{\omega - 1}{2\omega }})$ (where $T$ is the length of the demonstrator’s trajectory and $\omega$ is a constant that depends on the geometry of the action set). We complement this result with an information-theoretic lower bound for any inverse learning procedure. We corroborate our theoretical results with simulations on synthetic data and a demonstration constructed from the MovieLens dataset.

Cite this Paper


BibTeX
@InProceedings{pmlr-v244-guha24a, title = {One Shot Inverse Reinforcement Learning for Stochastic Linear Bandits}, author = {Guha, Etash and James, Jim and Acharya, Krishna and Muthukumar, Vidya and Pananjady, Ashwin}, booktitle = {Proceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence}, pages = {1491--1512}, year = {2024}, editor = {Kiyavash, Negar and Mooij, Joris M.}, volume = {244}, series = {Proceedings of Machine Learning Research}, month = {15--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v244/main/assets/guha24a/guha24a.pdf}, url = {https://proceedings.mlr.press/v244/guha24a.html}, abstract = {The paradigm of inverse reinforcement learning (IRL) is used to specify the reward function of an agent purely from its actions and is critical for value alignment and AI safety. While IRL is successful in practice, theoretical guarantees remain nascent. Motivated by the need for IRL in large action spaces with limited data, we consider as a first step the problem of learning from a single sequence of actions (i.e., a demonstration) of a stochastic linear bandit algorithm. When the demonstrator employs the Phased Elimination algorithm, we develop a simple inverse learning procedure that estimates the linear reward function consistently in the time horizon with just a single demonstration. In particular, we show that our inverse learner approximates the true reward parameter within a error of $\mathcal{O}(T^{-\frac{\omega - 1}{2\omega }})$ (where $T$ is the length of the demonstrator’s trajectory and $\omega$ is a constant that depends on the geometry of the action set). We complement this result with an information-theoretic lower bound for any inverse learning procedure. We corroborate our theoretical results with simulations on synthetic data and a demonstration constructed from the MovieLens dataset.} }
Endnote
%0 Conference Paper %T One Shot Inverse Reinforcement Learning for Stochastic Linear Bandits %A Etash Guha %A Jim James %A Krishna Acharya %A Vidya Muthukumar %A Ashwin Pananjady %B Proceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence %C Proceedings of Machine Learning Research %D 2024 %E Negar Kiyavash %E Joris M. Mooij %F pmlr-v244-guha24a %I PMLR %P 1491--1512 %U https://proceedings.mlr.press/v244/guha24a.html %V 244 %X The paradigm of inverse reinforcement learning (IRL) is used to specify the reward function of an agent purely from its actions and is critical for value alignment and AI safety. While IRL is successful in practice, theoretical guarantees remain nascent. Motivated by the need for IRL in large action spaces with limited data, we consider as a first step the problem of learning from a single sequence of actions (i.e., a demonstration) of a stochastic linear bandit algorithm. When the demonstrator employs the Phased Elimination algorithm, we develop a simple inverse learning procedure that estimates the linear reward function consistently in the time horizon with just a single demonstration. In particular, we show that our inverse learner approximates the true reward parameter within a error of $\mathcal{O}(T^{-\frac{\omega - 1}{2\omega }})$ (where $T$ is the length of the demonstrator’s trajectory and $\omega$ is a constant that depends on the geometry of the action set). We complement this result with an information-theoretic lower bound for any inverse learning procedure. We corroborate our theoretical results with simulations on synthetic data and a demonstration constructed from the MovieLens dataset.
APA
Guha, E., James, J., Acharya, K., Muthukumar, V. & Pananjady, A.. (2024). One Shot Inverse Reinforcement Learning for Stochastic Linear Bandits. Proceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence, in Proceedings of Machine Learning Research 244:1491-1512 Available from https://proceedings.mlr.press/v244/guha24a.html.

Related Material