Degeneration-free Policy Optimization: RL Fine-Tuning for Language Models without Degeneration

Youngsoo Jang, Geon-Hyeong Kim, Byoungjip Kim, Yu Jin Kim, Honglak Lee, Moontae Lee
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:21266-21288, 2024.

Abstract

As the pre-training objectives (e.g., next token prediction) of language models (LMs) are inherently not aligned with task scores, optimizing LMs to achieve higher downstream task scores is essential. One of the promising approaches is to fine-tune LMs through reinforcement learning (RL). However, conventional RL methods based on PPO and a penalty of KL divergence are vulnerable to text degeneration where LMs do not generate natural texts anymore after RL fine-tuning. To address this problem, we provide Degeneration-free Policy Optimization (DfPO) that can fine-tune LMs to generate texts that achieve improved downstream task scores, while preserving the ability to generate natural texts. To achieve this, we introduce KL-masking which masks out the actions that potentially cause deviation from the reference policy when its likelihood is increased or decreased. Then, we devise truncated advantage functions for separately performing likelihood maximization and minimization to improve the task performance. In the experiments, we provide the results of DfPO and baseline algorithms on various generative NLP tasks including text continuation, text detoxification, and commonsense generation. Our experiments demonstrate that DfPO successfully improves the downstream task scores while preserving the ability to generate natural texts, without requiring additional hyperparameter search.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-jang24b, title = {Degeneration-free Policy Optimization: {RL} Fine-Tuning for Language Models without Degeneration}, author = {Jang, Youngsoo and Kim, Geon-Hyeong and Kim, Byoungjip and Kim, Yu Jin and Lee, Honglak and Lee, Moontae}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {21266--21288}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/jang24b/jang24b.pdf}, url = {https://proceedings.mlr.press/v235/jang24b.html}, abstract = {As the pre-training objectives (e.g., next token prediction) of language models (LMs) are inherently not aligned with task scores, optimizing LMs to achieve higher downstream task scores is essential. One of the promising approaches is to fine-tune LMs through reinforcement learning (RL). However, conventional RL methods based on PPO and a penalty of KL divergence are vulnerable to text degeneration where LMs do not generate natural texts anymore after RL fine-tuning. To address this problem, we provide Degeneration-free Policy Optimization (DfPO) that can fine-tune LMs to generate texts that achieve improved downstream task scores, while preserving the ability to generate natural texts. To achieve this, we introduce KL-masking which masks out the actions that potentially cause deviation from the reference policy when its likelihood is increased or decreased. Then, we devise truncated advantage functions for separately performing likelihood maximization and minimization to improve the task performance. In the experiments, we provide the results of DfPO and baseline algorithms on various generative NLP tasks including text continuation, text detoxification, and commonsense generation. Our experiments demonstrate that DfPO successfully improves the downstream task scores while preserving the ability to generate natural texts, without requiring additional hyperparameter search.} }
Endnote
%0 Conference Paper %T Degeneration-free Policy Optimization: RL Fine-Tuning for Language Models without Degeneration %A Youngsoo Jang %A Geon-Hyeong Kim %A Byoungjip Kim %A Yu Jin Kim %A Honglak Lee %A Moontae Lee %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-jang24b %I PMLR %P 21266--21288 %U https://proceedings.mlr.press/v235/jang24b.html %V 235 %X As the pre-training objectives (e.g., next token prediction) of language models (LMs) are inherently not aligned with task scores, optimizing LMs to achieve higher downstream task scores is essential. One of the promising approaches is to fine-tune LMs through reinforcement learning (RL). However, conventional RL methods based on PPO and a penalty of KL divergence are vulnerable to text degeneration where LMs do not generate natural texts anymore after RL fine-tuning. To address this problem, we provide Degeneration-free Policy Optimization (DfPO) that can fine-tune LMs to generate texts that achieve improved downstream task scores, while preserving the ability to generate natural texts. To achieve this, we introduce KL-masking which masks out the actions that potentially cause deviation from the reference policy when its likelihood is increased or decreased. Then, we devise truncated advantage functions for separately performing likelihood maximization and minimization to improve the task performance. In the experiments, we provide the results of DfPO and baseline algorithms on various generative NLP tasks including text continuation, text detoxification, and commonsense generation. Our experiments demonstrate that DfPO successfully improves the downstream task scores while preserving the ability to generate natural texts, without requiring additional hyperparameter search.
APA
Jang, Y., Kim, G., Kim, B., Kim, Y.J., Lee, H. & Lee, M.. (2024). Degeneration-free Policy Optimization: RL Fine-Tuning for Language Models without Degeneration. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:21266-21288 Available from https://proceedings.mlr.press/v235/jang24b.html.

Related Material