BRAIn: Bayesian Reward-conditioned Amortized Inference for natural language generation from feedback

Gaurav Pandey, Yatin Nandwani, Tahira Naseem, Mayank Mishra, Guangxuan Xu, Dinesh Raghu, Sachindra Joshi, Asim Munawar, Ramón Fernandez Astudillo
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:39400-39415, 2024.

Abstract

Distribution matching methods for language model alignment such as Generation with Distributional Control (GDC) and Distributional Policy Gradient (DPG) have not received the same level of attention in reinforcement learning from human feedback (RLHF) as contrastive methods such as Sequence Likelihood Calibration (SLiC), Direct Preference Optimization (DPO) and its variants. We identify high variance of the gradient estimate as the primary reason for the lack of success of these methods and propose a self-normalized baseline to reduce the variance. We further generalize the target distribution in DPG, GDC and DPO by using Bayes’ rule to define the reward-conditioned posterior. The resulting approach, referred to as BRAIn - Bayesian Reward-conditioned Amortized Inference acts as a bridge between distribution matching methods and DPO and significantly outperforms prior art in summarization and Antropic HH tasks.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-pandey24a, title = {{BRAI}n: {B}ayesian Reward-conditioned Amortized Inference for natural language generation from feedback}, author = {Pandey, Gaurav and Nandwani, Yatin and Naseem, Tahira and Mishra, Mayank and Xu, Guangxuan and Raghu, Dinesh and Joshi, Sachindra and Munawar, Asim and Astudillo, Ram\'{o}n Fernandez}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {39400--39415}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/pandey24a/pandey24a.pdf}, url = {https://proceedings.mlr.press/v235/pandey24a.html}, abstract = {Distribution matching methods for language model alignment such as Generation with Distributional Control (GDC) and Distributional Policy Gradient (DPG) have not received the same level of attention in reinforcement learning from human feedback (RLHF) as contrastive methods such as Sequence Likelihood Calibration (SLiC), Direct Preference Optimization (DPO) and its variants. We identify high variance of the gradient estimate as the primary reason for the lack of success of these methods and propose a self-normalized baseline to reduce the variance. We further generalize the target distribution in DPG, GDC and DPO by using Bayes’ rule to define the reward-conditioned posterior. The resulting approach, referred to as BRAIn - Bayesian Reward-conditioned Amortized Inference acts as a bridge between distribution matching methods and DPO and significantly outperforms prior art in summarization and Antropic HH tasks.} }
Endnote
%0 Conference Paper %T BRAIn: Bayesian Reward-conditioned Amortized Inference for natural language generation from feedback %A Gaurav Pandey %A Yatin Nandwani %A Tahira Naseem %A Mayank Mishra %A Guangxuan Xu %A Dinesh Raghu %A Sachindra Joshi %A Asim Munawar %A Ramón Fernandez Astudillo %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-pandey24a %I PMLR %P 39400--39415 %U https://proceedings.mlr.press/v235/pandey24a.html %V 235 %X Distribution matching methods for language model alignment such as Generation with Distributional Control (GDC) and Distributional Policy Gradient (DPG) have not received the same level of attention in reinforcement learning from human feedback (RLHF) as contrastive methods such as Sequence Likelihood Calibration (SLiC), Direct Preference Optimization (DPO) and its variants. We identify high variance of the gradient estimate as the primary reason for the lack of success of these methods and propose a self-normalized baseline to reduce the variance. We further generalize the target distribution in DPG, GDC and DPO by using Bayes’ rule to define the reward-conditioned posterior. The resulting approach, referred to as BRAIn - Bayesian Reward-conditioned Amortized Inference acts as a bridge between distribution matching methods and DPO and significantly outperforms prior art in summarization and Antropic HH tasks.
APA
Pandey, G., Nandwani, Y., Naseem, T., Mishra, M., Xu, G., Raghu, D., Joshi, S., Munawar, A. & Astudillo, R.F.. (2024). BRAIn: Bayesian Reward-conditioned Amortized Inference for natural language generation from feedback. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:39400-39415 Available from https://proceedings.mlr.press/v235/pandey24a.html.

Related Material