RLTHF: Targeted Human Feedback for LLM Alignment

Yifei Xu, Tusher Chakraborty, Emre Kiciman, Bibek Aryal, Srinagesh Sharma, Songwu Lu, Ranveer Chandra
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:69096-69115, 2025.

Abstract

Fine-tuning large language models (LLMs) to align with user preferences is challenging due to the high cost of quality human annotations in Reinforcement Learning from Human Feedback (RLHF) and the generalizability limitations of AI Feedback. To address these challenges, we propose RLTHF, a human-AI hybrid framework that combines LLM-based initial alignment with selective human annotations to achieve full-human annotation alignment with minimal effort. RLTHF identifies hard-to-annotate samples mislabeled by LLMs using a reward model’s reward distribution and iteratively enhances alignment by integrating strategic human corrections while leveraging LLM’s correctly labeled samples. Evaluations on HH-RLHF and TL;DR datasets show that RLTHF reaches full-human annotation-level alignment with only 6-7% of the human annotation effort. Furthermore, models trained on RLTHF’s curated datasets for downstream tasks outperform those trained on fully human-annotated datasets, underscoring the effectiveness of RLTHF.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-xu25c, title = {{RLTHF}: Targeted Human Feedback for {LLM} Alignment}, author = {Xu, Yifei and Chakraborty, Tusher and Kiciman, Emre and Aryal, Bibek and Sharma, Srinagesh and Lu, Songwu and Chandra, Ranveer}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {69096--69115}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/xu25c/xu25c.pdf}, url = {https://proceedings.mlr.press/v267/xu25c.html}, abstract = {Fine-tuning large language models (LLMs) to align with user preferences is challenging due to the high cost of quality human annotations in Reinforcement Learning from Human Feedback (RLHF) and the generalizability limitations of AI Feedback. To address these challenges, we propose RLTHF, a human-AI hybrid framework that combines LLM-based initial alignment with selective human annotations to achieve full-human annotation alignment with minimal effort. RLTHF identifies hard-to-annotate samples mislabeled by LLMs using a reward model’s reward distribution and iteratively enhances alignment by integrating strategic human corrections while leveraging LLM’s correctly labeled samples. Evaluations on HH-RLHF and TL;DR datasets show that RLTHF reaches full-human annotation-level alignment with only 6-7% of the human annotation effort. Furthermore, models trained on RLTHF’s curated datasets for downstream tasks outperform those trained on fully human-annotated datasets, underscoring the effectiveness of RLTHF.} }
Endnote
%0 Conference Paper %T RLTHF: Targeted Human Feedback for LLM Alignment %A Yifei Xu %A Tusher Chakraborty %A Emre Kiciman %A Bibek Aryal %A Srinagesh Sharma %A Songwu Lu %A Ranveer Chandra %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-xu25c %I PMLR %P 69096--69115 %U https://proceedings.mlr.press/v267/xu25c.html %V 267 %X Fine-tuning large language models (LLMs) to align with user preferences is challenging due to the high cost of quality human annotations in Reinforcement Learning from Human Feedback (RLHF) and the generalizability limitations of AI Feedback. To address these challenges, we propose RLTHF, a human-AI hybrid framework that combines LLM-based initial alignment with selective human annotations to achieve full-human annotation alignment with minimal effort. RLTHF identifies hard-to-annotate samples mislabeled by LLMs using a reward model’s reward distribution and iteratively enhances alignment by integrating strategic human corrections while leveraging LLM’s correctly labeled samples. Evaluations on HH-RLHF and TL;DR datasets show that RLTHF reaches full-human annotation-level alignment with only 6-7% of the human annotation effort. Furthermore, models trained on RLTHF’s curated datasets for downstream tasks outperform those trained on fully human-annotated datasets, underscoring the effectiveness of RLTHF.
APA
Xu, Y., Chakraborty, T., Kiciman, E., Aryal, B., Sharma, S., Lu, S. & Chandra, R.. (2025). RLTHF: Targeted Human Feedback for LLM Alignment. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:69096-69115 Available from https://proceedings.mlr.press/v267/xu25c.html.

Related Material