[edit]
One Shot Inverse Reinforcement Learning for Stochastic Linear Bandits
Proceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence, PMLR 244:1491-1512, 2024.
Abstract
The paradigm of inverse reinforcement learning (IRL) is used to specify the reward function of an agent purely from its actions and is critical for value alignment and AI safety. While IRL is successful in practice, theoretical guarantees remain nascent. Motivated by the need for IRL in large action spaces with limited data, we consider as a first step the problem of learning from a single sequence of actions (i.e., a demonstration) of a stochastic linear bandit algorithm. When the demonstrator employs the Phased Elimination algorithm, we develop a simple inverse learning procedure that estimates the linear reward function consistently in the time horizon with just a single demonstration. In particular, we show that our inverse learner approximates the true reward parameter within a error of $\mathcal{O}(T^{-\frac{\omega - 1}{2\omega }})$ (where $T$ is the length of the demonstrator’s trajectory and $\omega$ is a constant that depends on the geometry of the action set). We complement this result with an information-theoretic lower bound for any inverse learning procedure. We corroborate our theoretical results with simulations on synthetic data and a demonstration constructed from the MovieLens dataset.