Portable Reward Tuning: Towards Reusable Fine-Tuning across Different Pretrained Models

Daiki Chijiwa, Taku Hasegawa, Kyosuke Nishida, Kuniko Saito, Susumu Takeuchi
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:10428-10448, 2025.

Abstract

While foundation models have been exploited for various expert tasks with their fine-tuned parameters, any foundation model will be eventually outdated due to its old knowledge or limited capability, and thus should be replaced by a new foundation model. Subsequently, to benefit from its latest knowledge or improved capability, the new foundation model should be fine-tuned on each task again, which incurs not only the additional training cost but also the maintenance cost of the task-specific data. Existing work address this problem by inference-time tuning, i.e., modifying the output probability from the new foundation model by the outputs from the old foundation model and its fine-tuned model, which involves an additional inference cost by the latter two models. In this paper, we explore a new fine-tuning principle (which we call portable reward tuning; PRT) that reduces the inference cost by its nature, based on the reformulation of fine-tuning as the reward maximization with Kullback-Leibler regularization. Specifically, instead of fine-tuning parameters of the foundation models, PRT trains the reward model explicitly through the same loss as in fine-tuning. During inference, the reward model can be used with any foundation model (with the same set of vocabularies or labels) through the formulation of reward maximization. Experimental results, including both vision and language models, show that the PRT-trained model can achieve comparable accuracy with less inference cost, in comparison to the existing work of inference-time tuning.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-chijiwa25a, title = {Portable Reward Tuning: Towards Reusable Fine-Tuning across Different Pretrained Models}, author = {Chijiwa, Daiki and Hasegawa, Taku and Nishida, Kyosuke and Saito, Kuniko and Takeuchi, Susumu}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {10428--10448}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/chijiwa25a/chijiwa25a.pdf}, url = {https://proceedings.mlr.press/v267/chijiwa25a.html}, abstract = {While foundation models have been exploited for various expert tasks with their fine-tuned parameters, any foundation model will be eventually outdated due to its old knowledge or limited capability, and thus should be replaced by a new foundation model. Subsequently, to benefit from its latest knowledge or improved capability, the new foundation model should be fine-tuned on each task again, which incurs not only the additional training cost but also the maintenance cost of the task-specific data. Existing work address this problem by inference-time tuning, i.e., modifying the output probability from the new foundation model by the outputs from the old foundation model and its fine-tuned model, which involves an additional inference cost by the latter two models. In this paper, we explore a new fine-tuning principle (which we call portable reward tuning; PRT) that reduces the inference cost by its nature, based on the reformulation of fine-tuning as the reward maximization with Kullback-Leibler regularization. Specifically, instead of fine-tuning parameters of the foundation models, PRT trains the reward model explicitly through the same loss as in fine-tuning. During inference, the reward model can be used with any foundation model (with the same set of vocabularies or labels) through the formulation of reward maximization. Experimental results, including both vision and language models, show that the PRT-trained model can achieve comparable accuracy with less inference cost, in comparison to the existing work of inference-time tuning.} }
Endnote
%0 Conference Paper %T Portable Reward Tuning: Towards Reusable Fine-Tuning across Different Pretrained Models %A Daiki Chijiwa %A Taku Hasegawa %A Kyosuke Nishida %A Kuniko Saito %A Susumu Takeuchi %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-chijiwa25a %I PMLR %P 10428--10448 %U https://proceedings.mlr.press/v267/chijiwa25a.html %V 267 %X While foundation models have been exploited for various expert tasks with their fine-tuned parameters, any foundation model will be eventually outdated due to its old knowledge or limited capability, and thus should be replaced by a new foundation model. Subsequently, to benefit from its latest knowledge or improved capability, the new foundation model should be fine-tuned on each task again, which incurs not only the additional training cost but also the maintenance cost of the task-specific data. Existing work address this problem by inference-time tuning, i.e., modifying the output probability from the new foundation model by the outputs from the old foundation model and its fine-tuned model, which involves an additional inference cost by the latter two models. In this paper, we explore a new fine-tuning principle (which we call portable reward tuning; PRT) that reduces the inference cost by its nature, based on the reformulation of fine-tuning as the reward maximization with Kullback-Leibler regularization. Specifically, instead of fine-tuning parameters of the foundation models, PRT trains the reward model explicitly through the same loss as in fine-tuning. During inference, the reward model can be used with any foundation model (with the same set of vocabularies or labels) through the formulation of reward maximization. Experimental results, including both vision and language models, show that the PRT-trained model can achieve comparable accuracy with less inference cost, in comparison to the existing work of inference-time tuning.
APA
Chijiwa, D., Hasegawa, T., Nishida, K., Saito, K. & Takeuchi, S.. (2025). Portable Reward Tuning: Towards Reusable Fine-Tuning across Different Pretrained Models. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:10428-10448 Available from https://proceedings.mlr.press/v267/chijiwa25a.html.

Related Material