Off-Policy Actor-Critic for Adversarial Observation Robustness: Virtual Alternative Training via Symmetric Policy Evaluation

Kosuke Nakanishi, Akihiro Kubo, Yuji Yasui, Shin Ishii
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:45624-45662, 2025.

Abstract

Recently, robust reinforcement learning (RL) methods designed to handle adversarial input observations have received significant attention, motivated by RL’s inherent vulnerabilities. While existing approaches have demonstrated reasonable success, addressing worst-case scenarios over long time horizons requires both minimizing the agent’s cumulative rewards for adversaries and training agents to counteract them through alternating learning. However, this process introduces mutual dependencies between the agent and the adversary, making interactions with the environment inefficient and hindering the development of off-policy methods. In this work, we propose a novel off-policy method that eliminates the need for additional environmental interactions by reformulating adversarial learning as a soft-constrained optimization problem. Our approach is theoretically supported by the symmetric property of policy evaluation between the agent and the adversary. The implementation is available at https://github.com/nakanakakosuke/VALT_SAC.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-nakanishi25a, title = {Off-Policy Actor-Critic for Adversarial Observation Robustness: Virtual Alternative Training via Symmetric Policy Evaluation}, author = {Nakanishi, Kosuke and Kubo, Akihiro and Yasui, Yuji and Ishii, Shin}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {45624--45662}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/nakanishi25a/nakanishi25a.pdf}, url = {https://proceedings.mlr.press/v267/nakanishi25a.html}, abstract = {Recently, robust reinforcement learning (RL) methods designed to handle adversarial input observations have received significant attention, motivated by RL’s inherent vulnerabilities. While existing approaches have demonstrated reasonable success, addressing worst-case scenarios over long time horizons requires both minimizing the agent’s cumulative rewards for adversaries and training agents to counteract them through alternating learning. However, this process introduces mutual dependencies between the agent and the adversary, making interactions with the environment inefficient and hindering the development of off-policy methods. In this work, we propose a novel off-policy method that eliminates the need for additional environmental interactions by reformulating adversarial learning as a soft-constrained optimization problem. Our approach is theoretically supported by the symmetric property of policy evaluation between the agent and the adversary. The implementation is available at https://github.com/nakanakakosuke/VALT_SAC.} }
Endnote
%0 Conference Paper %T Off-Policy Actor-Critic for Adversarial Observation Robustness: Virtual Alternative Training via Symmetric Policy Evaluation %A Kosuke Nakanishi %A Akihiro Kubo %A Yuji Yasui %A Shin Ishii %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-nakanishi25a %I PMLR %P 45624--45662 %U https://proceedings.mlr.press/v267/nakanishi25a.html %V 267 %X Recently, robust reinforcement learning (RL) methods designed to handle adversarial input observations have received significant attention, motivated by RL’s inherent vulnerabilities. While existing approaches have demonstrated reasonable success, addressing worst-case scenarios over long time horizons requires both minimizing the agent’s cumulative rewards for adversaries and training agents to counteract them through alternating learning. However, this process introduces mutual dependencies between the agent and the adversary, making interactions with the environment inefficient and hindering the development of off-policy methods. In this work, we propose a novel off-policy method that eliminates the need for additional environmental interactions by reformulating adversarial learning as a soft-constrained optimization problem. Our approach is theoretically supported by the symmetric property of policy evaluation between the agent and the adversary. The implementation is available at https://github.com/nakanakakosuke/VALT_SAC.
APA
Nakanishi, K., Kubo, A., Yasui, Y. & Ishii, S.. (2025). Off-Policy Actor-Critic for Adversarial Observation Robustness: Virtual Alternative Training via Symmetric Policy Evaluation. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:45624-45662 Available from https://proceedings.mlr.press/v267/nakanishi25a.html.

Related Material