Learning to Reuse Policies in State Evolvable Environments

Ziqian Zhang, Bohan Yang, Lihe Li, Yuqi Bian, Ruiqi Xue, Feng Chen, Yi-Chen Li, Lei Yuan, Yang Yu
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:76451-76476, 2025.

Abstract

The policy trained via reinforcement learning (RL) makes decisions based on sensor-derived state features. It is common for state features to evolve for reasons such as periodic sensor maintenance or the addition of new sensors for performance improvement. The deployed policy fails in new state space when state features are unseen during training. Previous work tackles this challenge by training a sensor-invariant policy or generating multiple policies and selecting the appropriate one with limited samples. However, both directions struggle to guarantee the performance when faced with unpredictable evolutions. In this paper, we formalize this problem as state evolvable reinforcement learning (SERL), where the agent is required to mitigate policy degradation after state evolutions without costly exploration. We propose Lapse by reusing policies learned from the old state space in two distinct aspects. On one hand, Lapse directly reuses the robust old policy by composing it with a learned state reconstruction model to handle vanishing sensors. On the other hand, the behavioral experience from the old policy is reused by Lapse to train a newly adaptive policy through offline learning, better utilizing new sensors. To leverage advantages of both policies in different scenarios, we further propose automatic ensemble weight adjustment to effectively aggregate them. Theoretically, we justify that robust policy reuse helps mitigate uncertainty and error from both evolution and reconstruction. Empirically, Lapse achieves a significant performance improvement, outperforming the strongest baseline by about $2\times$ in benchmark environments.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-zhang25ck, title = {Learning to Reuse Policies in State Evolvable Environments}, author = {Zhang, Ziqian and Yang, Bohan and Li, Lihe and Bian, Yuqi and Xue, Ruiqi and Chen, Feng and Li, Yi-Chen and Yuan, Lei and Yu, Yang}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {76451--76476}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/zhang25ck/zhang25ck.pdf}, url = {https://proceedings.mlr.press/v267/zhang25ck.html}, abstract = {The policy trained via reinforcement learning (RL) makes decisions based on sensor-derived state features. It is common for state features to evolve for reasons such as periodic sensor maintenance or the addition of new sensors for performance improvement. The deployed policy fails in new state space when state features are unseen during training. Previous work tackles this challenge by training a sensor-invariant policy or generating multiple policies and selecting the appropriate one with limited samples. However, both directions struggle to guarantee the performance when faced with unpredictable evolutions. In this paper, we formalize this problem as state evolvable reinforcement learning (SERL), where the agent is required to mitigate policy degradation after state evolutions without costly exploration. We propose Lapse by reusing policies learned from the old state space in two distinct aspects. On one hand, Lapse directly reuses the robust old policy by composing it with a learned state reconstruction model to handle vanishing sensors. On the other hand, the behavioral experience from the old policy is reused by Lapse to train a newly adaptive policy through offline learning, better utilizing new sensors. To leverage advantages of both policies in different scenarios, we further propose automatic ensemble weight adjustment to effectively aggregate them. Theoretically, we justify that robust policy reuse helps mitigate uncertainty and error from both evolution and reconstruction. Empirically, Lapse achieves a significant performance improvement, outperforming the strongest baseline by about $2\times$ in benchmark environments.} }
Endnote
%0 Conference Paper %T Learning to Reuse Policies in State Evolvable Environments %A Ziqian Zhang %A Bohan Yang %A Lihe Li %A Yuqi Bian %A Ruiqi Xue %A Feng Chen %A Yi-Chen Li %A Lei Yuan %A Yang Yu %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-zhang25ck %I PMLR %P 76451--76476 %U https://proceedings.mlr.press/v267/zhang25ck.html %V 267 %X The policy trained via reinforcement learning (RL) makes decisions based on sensor-derived state features. It is common for state features to evolve for reasons such as periodic sensor maintenance or the addition of new sensors for performance improvement. The deployed policy fails in new state space when state features are unseen during training. Previous work tackles this challenge by training a sensor-invariant policy or generating multiple policies and selecting the appropriate one with limited samples. However, both directions struggle to guarantee the performance when faced with unpredictable evolutions. In this paper, we formalize this problem as state evolvable reinforcement learning (SERL), where the agent is required to mitigate policy degradation after state evolutions without costly exploration. We propose Lapse by reusing policies learned from the old state space in two distinct aspects. On one hand, Lapse directly reuses the robust old policy by composing it with a learned state reconstruction model to handle vanishing sensors. On the other hand, the behavioral experience from the old policy is reused by Lapse to train a newly adaptive policy through offline learning, better utilizing new sensors. To leverage advantages of both policies in different scenarios, we further propose automatic ensemble weight adjustment to effectively aggregate them. Theoretically, we justify that robust policy reuse helps mitigate uncertainty and error from both evolution and reconstruction. Empirically, Lapse achieves a significant performance improvement, outperforming the strongest baseline by about $2\times$ in benchmark environments.
APA
Zhang, Z., Yang, B., Li, L., Bian, Y., Xue, R., Chen, F., Li, Y., Yuan, L. & Yu, Y.. (2025). Learning to Reuse Policies in State Evolvable Environments. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:76451-76476 Available from https://proceedings.mlr.press/v267/zhang25ck.html.

Related Material