MM-RLHF: The Next Step Forward in Multimodal LLM Alignment

Yifan Zhang; Tao Yu; Haochen Tian; Chaoyou Fu; Peiyan Li; Jianshu Zeng; Wulin Xie; Yang Shi; Huanyu Zhang; Junkang Wu; Xue Wang; Yibo Hu; Bin Wen; Tingting Gao; Zhang Zhang; Fan Yang; Di Zhang; Liang Wang; Rong Jin

MM-RLHF: The Next Step Forward in Multimodal LLM Alignment

Yifan Zhang, Tao Yu, Haochen Tian, Chaoyou Fu, Peiyan Li, Jianshu Zeng, Wulin Xie, Yang Shi, Huanyu Zhang, Junkang Wu, Xue Wang, Yibo Hu, Bin Wen, Tingting Gao, Zhang Zhang, Fan Yang, Di Zhang, Liang Wang, Rong Jin

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:76625-76654, 2025.

Abstract

Existing efforts to align multimodal large language models (MLLMs) with human preferences have only achieved progress in narrow areas, such as hallucination reduction, but remain limited in practical applicability and generalizability. To this end, we introduce MM-RLHF, a dataset containing 120k fine-grained, human-annotated preference comparison pairs. This dataset represents a substantial advancement over existing resources, offering superior size, diversity, annotation granularity, and quality. Leveraging this dataset, we propose several key innovations to improve both the quality of reward models and the efficiency of alignment algorithms. Notably, we introduce the Critique-Based Reward Model, which generates critiques of model outputs before assigning scores, offering enhanced interpretability and more informative feedback compared to traditional scalar reward mechanisms. Additionally, we propose Dynamic Reward Scaling, a method that adjusts the loss weight of each sample according to the reward signal, thereby optimizing the use of high-quality comparison pairs. Our approach is rigorously evaluated across 10 distinct dimensions, encompassing 27 benchmarks, with results demonstrating significant and consistent improvements in model performance (Figure.1).

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-zhang25cs,
  title = 	 {{MM}-{RLHF}: The Next Step Forward in Multimodal {LLM} Alignment},
  author =       {Zhang, Yifan and Yu, Tao and Tian, Haochen and Fu, Chaoyou and Li, Peiyan and Zeng, Jianshu and Xie, Wulin and Shi, Yang and Zhang, Huanyu and Wu, Junkang and Wang, Xue and Hu, Yibo and Wen, Bin and Gao, Tingting and Zhang, Zhang and Yang, Fan and Zhang, Di and Wang, Liang and Jin, Rong},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {76625--76654},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/zhang25cs/zhang25cs.pdf},
  url = 	 {https://proceedings.mlr.press/v267/zhang25cs.html},
  abstract = 	 {Existing efforts to align multimodal large language models (MLLMs) with human preferences have only achieved progress in narrow areas, such as hallucination reduction, but remain limited in practical applicability and generalizability. To this end, we introduce MM-RLHF, a dataset containing 120k fine-grained, human-annotated preference comparison pairs. This dataset represents a substantial advancement over existing resources, offering superior size, diversity, annotation granularity, and quality. Leveraging this dataset, we propose several key innovations to improve both the quality of reward models and the efficiency of alignment algorithms. Notably, we introduce the Critique-Based Reward Model, which generates critiques of model outputs before assigning scores, offering enhanced interpretability and more informative feedback compared to traditional scalar reward mechanisms. Additionally, we propose Dynamic Reward Scaling, a method that adjusts the loss weight of each sample according to the reward signal, thereby optimizing the use of high-quality comparison pairs. Our approach is rigorously evaluated across 10 distinct dimensions, encompassing 27 benchmarks, with results demonstrating significant and consistent improvements in model performance (Figure.1).}
}

Endnote

%0 Conference Paper
%T MM-RLHF: The Next Step Forward in Multimodal LLM Alignment
%A Yifan Zhang
%A Tao Yu
%A Haochen Tian
%A Chaoyou Fu
%A Peiyan Li
%A Jianshu Zeng
%A Wulin Xie
%A Yang Shi
%A Huanyu Zhang
%A Junkang Wu
%A Xue Wang
%A Yibo Hu
%A Bin Wen
%A Tingting Gao
%A Zhang Zhang
%A Fan Yang
%A Di Zhang
%A Liang Wang
%A Rong Jin
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-zhang25cs
%I PMLR
%P 76625--76654
%U https://proceedings.mlr.press/v267/zhang25cs.html
%V 267
%X Existing efforts to align multimodal large language models (MLLMs) with human preferences have only achieved progress in narrow areas, such as hallucination reduction, but remain limited in practical applicability and generalizability. To this end, we introduce MM-RLHF, a dataset containing 120k fine-grained, human-annotated preference comparison pairs. This dataset represents a substantial advancement over existing resources, offering superior size, diversity, annotation granularity, and quality. Leveraging this dataset, we propose several key innovations to improve both the quality of reward models and the efficiency of alignment algorithms. Notably, we introduce the Critique-Based Reward Model, which generates critiques of model outputs before assigning scores, offering enhanced interpretability and more informative feedback compared to traditional scalar reward mechanisms. Additionally, we propose Dynamic Reward Scaling, a method that adjusts the loss weight of each sample according to the reward signal, thereby optimizing the use of high-quality comparison pairs. Our approach is rigorously evaluated across 10 distinct dimensions, encompassing 27 benchmarks, with results demonstrating significant and consistent improvements in model performance (Figure.1).

APA

Zhang, Y., Yu, T., Tian, H., Fu, C., Li, P., Zeng, J., Xie, W., Shi, Y., Zhang, H., Wu, J., Wang, X., Hu, Y., Wen, B., Gao, T., Zhang, Z., Yang, F., Zhang, D., Wang, L. & Jin, R.. (2025). MM-RLHF: The Next Step Forward in Multimodal LLM Alignment. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:76625-76654 Available from https://proceedings.mlr.press/v267/zhang25cs.html.

MM-RLHF: The Next Step Forward in Multimodal LLM Alignment

Abstract

Cite this Paper

Related Material