Adversarial Inception Backdoor Attacks against Reinforcement Learning

Ethan Rathbun, Alina Oprea, Christopher Amato
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:51273-51296, 2025.

Abstract

Recent works have demonstrated the vulnerability of Deep Reinforcement Learning (DRL) algorithms against training-time, backdoor poisoning attacks. The objectives of these attacks are twofold: induce pre-determined, adversarial behavior in the agent upon observing a fixed trigger during deployment while allowing the agent to solve its intended task during training. Prior attacks assume arbitrary control over the agent’s rewards, inducing values far outside the environment’s natural constraints. This results in brittle attacks that fail once the proper reward constraints are enforced. Thus, in this work we propose a new class of backdoor attacks against DRL which are the first to achieve state of the art performance under strict reward constraints. These “inception” attacks manipulate the agent’s training data – inserting the trigger into prior observations and replacing high return actions with those of the targeted adversarial behavior. We formally define these attacks and prove they achieve both adversarial objectives against arbitrary Markov Decision Processes (MDP). Using this framework we devise an online inception attack which achieves an 100% attack success rate on multiple environments under constrained rewards while minimally impacting the agent’s task performance.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-rathbun25a, title = {Adversarial Inception Backdoor Attacks against Reinforcement Learning}, author = {Rathbun, Ethan and Oprea, Alina and Amato, Christopher}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {51273--51296}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/rathbun25a/rathbun25a.pdf}, url = {https://proceedings.mlr.press/v267/rathbun25a.html}, abstract = {Recent works have demonstrated the vulnerability of Deep Reinforcement Learning (DRL) algorithms against training-time, backdoor poisoning attacks. The objectives of these attacks are twofold: induce pre-determined, adversarial behavior in the agent upon observing a fixed trigger during deployment while allowing the agent to solve its intended task during training. Prior attacks assume arbitrary control over the agent’s rewards, inducing values far outside the environment’s natural constraints. This results in brittle attacks that fail once the proper reward constraints are enforced. Thus, in this work we propose a new class of backdoor attacks against DRL which are the first to achieve state of the art performance under strict reward constraints. These “inception” attacks manipulate the agent’s training data – inserting the trigger into prior observations and replacing high return actions with those of the targeted adversarial behavior. We formally define these attacks and prove they achieve both adversarial objectives against arbitrary Markov Decision Processes (MDP). Using this framework we devise an online inception attack which achieves an 100% attack success rate on multiple environments under constrained rewards while minimally impacting the agent’s task performance.} }
Endnote
%0 Conference Paper %T Adversarial Inception Backdoor Attacks against Reinforcement Learning %A Ethan Rathbun %A Alina Oprea %A Christopher Amato %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-rathbun25a %I PMLR %P 51273--51296 %U https://proceedings.mlr.press/v267/rathbun25a.html %V 267 %X Recent works have demonstrated the vulnerability of Deep Reinforcement Learning (DRL) algorithms against training-time, backdoor poisoning attacks. The objectives of these attacks are twofold: induce pre-determined, adversarial behavior in the agent upon observing a fixed trigger during deployment while allowing the agent to solve its intended task during training. Prior attacks assume arbitrary control over the agent’s rewards, inducing values far outside the environment’s natural constraints. This results in brittle attacks that fail once the proper reward constraints are enforced. Thus, in this work we propose a new class of backdoor attacks against DRL which are the first to achieve state of the art performance under strict reward constraints. These “inception” attacks manipulate the agent’s training data – inserting the trigger into prior observations and replacing high return actions with those of the targeted adversarial behavior. We formally define these attacks and prove they achieve both adversarial objectives against arbitrary Markov Decision Processes (MDP). Using this framework we devise an online inception attack which achieves an 100% attack success rate on multiple environments under constrained rewards while minimally impacting the agent’s task performance.
APA
Rathbun, E., Oprea, A. & Amato, C.. (2025). Adversarial Inception Backdoor Attacks against Reinforcement Learning. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:51273-51296 Available from https://proceedings.mlr.press/v267/rathbun25a.html.

Related Material