Improved Off-policy Reinforcement Learning in Biological Sequence Design

Hyeonah Kim, Minsu Kim, Taeyoung Yun, Sanghyeok Choi, Emmanuel Bengio, Alex Hernández-Garcı́a, Jinkyoo Park
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:30290-30315, 2025.

Abstract

Designing biological sequences with desired properties is challenging due to vast search spaces and limited evaluation budgets. Although reinforcement learning methods use proxy models for rapid reward evaluation, insufficient training data can cause proxy misspecification on out-of-distribution inputs. To address this, we propose a novel off-policy search, $\delta$-Conservative Search, that enhances robustness by restricting policy exploration to reliable regions. Starting from high-score offline sequences, we inject noise by randomly masking tokens with probability $\delta$, then denoise them using our policy. We further adapt $\delta$ based on proxy uncertainty on each data point, aligning the level of conservativeness with model confidence. Experimental results show that our conservative search consistently enhances the off-policy training, outperforming existing machine learning methods in discovering high-score sequences across diverse tasks, including DNA, RNA, protein, and peptide design.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-kim25q, title = {Improved Off-policy Reinforcement Learning in Biological Sequence Design}, author = {Kim, Hyeonah and Kim, Minsu and Yun, Taeyoung and Choi, Sanghyeok and Bengio, Emmanuel and Hern\'{a}ndez-Garc\'{\i}a, Alex and Park, Jinkyoo}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {30290--30315}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/kim25q/kim25q.pdf}, url = {https://proceedings.mlr.press/v267/kim25q.html}, abstract = {Designing biological sequences with desired properties is challenging due to vast search spaces and limited evaluation budgets. Although reinforcement learning methods use proxy models for rapid reward evaluation, insufficient training data can cause proxy misspecification on out-of-distribution inputs. To address this, we propose a novel off-policy search, $\delta$-Conservative Search, that enhances robustness by restricting policy exploration to reliable regions. Starting from high-score offline sequences, we inject noise by randomly masking tokens with probability $\delta$, then denoise them using our policy. We further adapt $\delta$ based on proxy uncertainty on each data point, aligning the level of conservativeness with model confidence. Experimental results show that our conservative search consistently enhances the off-policy training, outperforming existing machine learning methods in discovering high-score sequences across diverse tasks, including DNA, RNA, protein, and peptide design.} }
Endnote
%0 Conference Paper %T Improved Off-policy Reinforcement Learning in Biological Sequence Design %A Hyeonah Kim %A Minsu Kim %A Taeyoung Yun %A Sanghyeok Choi %A Emmanuel Bengio %A Alex Hernández-Garcı́a %A Jinkyoo Park %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-kim25q %I PMLR %P 30290--30315 %U https://proceedings.mlr.press/v267/kim25q.html %V 267 %X Designing biological sequences with desired properties is challenging due to vast search spaces and limited evaluation budgets. Although reinforcement learning methods use proxy models for rapid reward evaluation, insufficient training data can cause proxy misspecification on out-of-distribution inputs. To address this, we propose a novel off-policy search, $\delta$-Conservative Search, that enhances robustness by restricting policy exploration to reliable regions. Starting from high-score offline sequences, we inject noise by randomly masking tokens with probability $\delta$, then denoise them using our policy. We further adapt $\delta$ based on proxy uncertainty on each data point, aligning the level of conservativeness with model confidence. Experimental results show that our conservative search consistently enhances the off-policy training, outperforming existing machine learning methods in discovering high-score sequences across diverse tasks, including DNA, RNA, protein, and peptide design.
APA
Kim, H., Kim, M., Yun, T., Choi, S., Bengio, E., Hernández-Garcı́a, A. & Park, J.. (2025). Improved Off-policy Reinforcement Learning in Biological Sequence Design. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:30290-30315 Available from https://proceedings.mlr.press/v267/kim25q.html.

Related Material