MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking

Sebastian Farquhar, Vikrant Varma, David Lindner, David Elson, Caleb Biddulph, Ian Goodfellow, Rohin Shah
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:16237-16272, 2025.

Abstract

Future advanced AI systems may learn sophisticated strategies through reinforcement learning (RL) that humans cannot understand well enough to safely evaluate. We propose a training method which avoids agents learning undesired multi-step plans that receive high reward (multi-step "reward hacks") even if humans are not able to detect that the behavior is undesired. The method, Myopic Optimization with Non-myopic Approval (MONA), works by combining short-sighted optimization with far-sighted reward. We demonstrate that MONA can prevent multi-step reward hacking that ordinary RL causes, even without being able to detect the reward hacking and without any extra information that ordinary RL does not get access to. We study MONA empirically in three settings which model different misalignment failure modes including 2-step environments with LLMs representing delegated oversight and encoded reasoning and longer-horizon gridworld environments representing sensor tampering.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-farquhar25a, title = {{MONA}: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking}, author = {Farquhar, Sebastian and Varma, Vikrant and Lindner, David and Elson, David and Biddulph, Caleb and Goodfellow, Ian and Shah, Rohin}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {16237--16272}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/farquhar25a/farquhar25a.pdf}, url = {https://proceedings.mlr.press/v267/farquhar25a.html}, abstract = {Future advanced AI systems may learn sophisticated strategies through reinforcement learning (RL) that humans cannot understand well enough to safely evaluate. We propose a training method which avoids agents learning undesired multi-step plans that receive high reward (multi-step "reward hacks") even if humans are not able to detect that the behavior is undesired. The method, Myopic Optimization with Non-myopic Approval (MONA), works by combining short-sighted optimization with far-sighted reward. We demonstrate that MONA can prevent multi-step reward hacking that ordinary RL causes, even without being able to detect the reward hacking and without any extra information that ordinary RL does not get access to. We study MONA empirically in three settings which model different misalignment failure modes including 2-step environments with LLMs representing delegated oversight and encoded reasoning and longer-horizon gridworld environments representing sensor tampering.} }
Endnote
%0 Conference Paper %T MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking %A Sebastian Farquhar %A Vikrant Varma %A David Lindner %A David Elson %A Caleb Biddulph %A Ian Goodfellow %A Rohin Shah %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-farquhar25a %I PMLR %P 16237--16272 %U https://proceedings.mlr.press/v267/farquhar25a.html %V 267 %X Future advanced AI systems may learn sophisticated strategies through reinforcement learning (RL) that humans cannot understand well enough to safely evaluate. We propose a training method which avoids agents learning undesired multi-step plans that receive high reward (multi-step "reward hacks") even if humans are not able to detect that the behavior is undesired. The method, Myopic Optimization with Non-myopic Approval (MONA), works by combining short-sighted optimization with far-sighted reward. We demonstrate that MONA can prevent multi-step reward hacking that ordinary RL causes, even without being able to detect the reward hacking and without any extra information that ordinary RL does not get access to. We study MONA empirically in three settings which model different misalignment failure modes including 2-step environments with LLMs representing delegated oversight and encoded reasoning and longer-horizon gridworld environments representing sensor tampering.
APA
Farquhar, S., Varma, V., Lindner, D., Elson, D., Biddulph, C., Goodfellow, I. & Shah, R.. (2025). MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:16237-16272 Available from https://proceedings.mlr.press/v267/farquhar25a.html.

Related Material