The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret

Lukas Fluri, Leon Lang, Alessandro Abate, Patrick Forré, David Krueger, Joar Max Viktor Skalse
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:17306-17377, 2025.

Abstract

In reinforcement learning, specifying reward functions that capture the intended task can be very challenging. Reward learning aims to address this issue by learning the reward function. However, a learned reward model may have a low error on the data distribution, and yet subsequently produce a policy with large regret. We say that such a reward model has an error-regret mismatch. The main source of an error-regret mismatch is the distributional shift that commonly occurs during policy optimization. In this paper, we mathematically show that a sufficiently low expected test error of the reward model guarantees low worst-case regret, but that for any fixed expected test error, there exist realistic data distributions that allow for error-regret mismatch to occur. We then show that similar problems persist even when using policy regularization techniques, commonly employed in methods such as RLHF. We hope our results stimulate the theoretical and empirical study of improved methods to learn reward models, and better ways to measure their quality reliably.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-fluri25a, title = {The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret}, author = {Fluri, Lukas and Lang, Leon and Abate, Alessandro and Forr\'{e}, Patrick and Krueger, David and Skalse, Joar Max Viktor}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {17306--17377}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/fluri25a/fluri25a.pdf}, url = {https://proceedings.mlr.press/v267/fluri25a.html}, abstract = {In reinforcement learning, specifying reward functions that capture the intended task can be very challenging. Reward learning aims to address this issue by learning the reward function. However, a learned reward model may have a low error on the data distribution, and yet subsequently produce a policy with large regret. We say that such a reward model has an error-regret mismatch. The main source of an error-regret mismatch is the distributional shift that commonly occurs during policy optimization. In this paper, we mathematically show that a sufficiently low expected test error of the reward model guarantees low worst-case regret, but that for any fixed expected test error, there exist realistic data distributions that allow for error-regret mismatch to occur. We then show that similar problems persist even when using policy regularization techniques, commonly employed in methods such as RLHF. We hope our results stimulate the theoretical and empirical study of improved methods to learn reward models, and better ways to measure their quality reliably.} }
Endnote
%0 Conference Paper %T The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret %A Lukas Fluri %A Leon Lang %A Alessandro Abate %A Patrick Forré %A David Krueger %A Joar Max Viktor Skalse %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-fluri25a %I PMLR %P 17306--17377 %U https://proceedings.mlr.press/v267/fluri25a.html %V 267 %X In reinforcement learning, specifying reward functions that capture the intended task can be very challenging. Reward learning aims to address this issue by learning the reward function. However, a learned reward model may have a low error on the data distribution, and yet subsequently produce a policy with large regret. We say that such a reward model has an error-regret mismatch. The main source of an error-regret mismatch is the distributional shift that commonly occurs during policy optimization. In this paper, we mathematically show that a sufficiently low expected test error of the reward model guarantees low worst-case regret, but that for any fixed expected test error, there exist realistic data distributions that allow for error-regret mismatch to occur. We then show that similar problems persist even when using policy regularization techniques, commonly employed in methods such as RLHF. We hope our results stimulate the theoretical and empirical study of improved methods to learn reward models, and better ways to measure their quality reliably.
APA
Fluri, L., Lang, L., Abate, A., Forré, P., Krueger, D. & Skalse, J.M.V.. (2025). The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:17306-17377 Available from https://proceedings.mlr.press/v267/fluri25a.html.

Related Material