Invariance in Policy Optimisation and Partial Identifiability in Reward Learning

Joar Max Viktor Skalse, Matthew Farrugia-Roberts, Stuart Russell, Alessandro Abate, Adam Gleave
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:32033-32058, 2023.

Abstract

It is often very challenging to manually design reward functions for complex, real-world tasks. To solve this, one can instead use reward learning to infer a reward function from data. However, there are often multiple reward functions that fit the data equally well, even in the infinite-data limit. This means that the reward function is only partially identifiable. In this work, we formally characterise the partial identifiability of the reward function given several popular reward learning data sources, including expert demonstrations and trajectory comparisons. We also analyse the impact of this partial identifiability for several downstream tasks, such as policy optimisation. We unify our results in a framework for comparing data sources and downstream tasks by their invariances, with implications for the design and selection of data sources for reward learning.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-skalse23a, title = {Invariance in Policy Optimisation and Partial Identifiability in Reward Learning}, author = {Skalse, Joar Max Viktor and Farrugia-Roberts, Matthew and Russell, Stuart and Abate, Alessandro and Gleave, Adam}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {32033--32058}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/skalse23a/skalse23a.pdf}, url = {https://proceedings.mlr.press/v202/skalse23a.html}, abstract = {It is often very challenging to manually design reward functions for complex, real-world tasks. To solve this, one can instead use reward learning to infer a reward function from data. However, there are often multiple reward functions that fit the data equally well, even in the infinite-data limit. This means that the reward function is only partially identifiable. In this work, we formally characterise the partial identifiability of the reward function given several popular reward learning data sources, including expert demonstrations and trajectory comparisons. We also analyse the impact of this partial identifiability for several downstream tasks, such as policy optimisation. We unify our results in a framework for comparing data sources and downstream tasks by their invariances, with implications for the design and selection of data sources for reward learning.} }
Endnote
%0 Conference Paper %T Invariance in Policy Optimisation and Partial Identifiability in Reward Learning %A Joar Max Viktor Skalse %A Matthew Farrugia-Roberts %A Stuart Russell %A Alessandro Abate %A Adam Gleave %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-skalse23a %I PMLR %P 32033--32058 %U https://proceedings.mlr.press/v202/skalse23a.html %V 202 %X It is often very challenging to manually design reward functions for complex, real-world tasks. To solve this, one can instead use reward learning to infer a reward function from data. However, there are often multiple reward functions that fit the data equally well, even in the infinite-data limit. This means that the reward function is only partially identifiable. In this work, we formally characterise the partial identifiability of the reward function given several popular reward learning data sources, including expert demonstrations and trajectory comparisons. We also analyse the impact of this partial identifiability for several downstream tasks, such as policy optimisation. We unify our results in a framework for comparing data sources and downstream tasks by their invariances, with implications for the design and selection of data sources for reward learning.
APA
Skalse, J.M.V., Farrugia-Roberts, M., Russell, S., Abate, A. & Gleave, A.. (2023). Invariance in Policy Optimisation and Partial Identifiability in Reward Learning. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:32033-32058 Available from https://proceedings.mlr.press/v202/skalse23a.html.

Related Material