Connecting the Dots: Collaborative Fine-tuning for Black-Box Vision-Language Models

Zhengbo Wang, Jian Liang, Ran He, Zilei Wang, Tieniu Tan
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:50931-50943, 2024.

Abstract

With the emergence of pretrained vision-language models (VLMs), considerable efforts have been devoted to fine-tuning them for downstream tasks. Despite the progress made in designing efficient fine-tuning methods, such methods require access to the model’s parameters, which can be challenging as model owners often opt to provide their models as a black box to safeguard model ownership. This paper proposes a Collaborative Fine-Tuning (CraFT) approach for fine-tuning black-box VLMs to downstream tasks, where one only has access to the input prompts and the output predictions of the model. CraFT comprises two modules, a prompt generation module for learning text prompts and a prediction refinement module for enhancing output predictions in residual style. Additionally, we introduce an auxiliary prediction-consistent loss to promote consistent optimization across these modules. These modules are optimized by a novel collaborative training algorithm. Extensive experiments on few-shot classification over 15 datasets demonstrate the superiority of CraFT. The results show that CraFT achieves a decent gain of about 12% with 16-shot datasets and only 8,000 queries. Moreover, CraFT trains faster and uses only about 1/80 of the memory footprint for deployment, while sacrificing only 1.62% compared to the white-box method. Our code is publicly available at https://github.com/mrflogs/CraFT.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-wang24ao, title = {Connecting the Dots: Collaborative Fine-tuning for Black-Box Vision-Language Models}, author = {Wang, Zhengbo and Liang, Jian and He, Ran and Wang, Zilei and Tan, Tieniu}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {50931--50943}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/wang24ao/wang24ao.pdf}, url = {https://proceedings.mlr.press/v235/wang24ao.html}, abstract = {With the emergence of pretrained vision-language models (VLMs), considerable efforts have been devoted to fine-tuning them for downstream tasks. Despite the progress made in designing efficient fine-tuning methods, such methods require access to the model’s parameters, which can be challenging as model owners often opt to provide their models as a black box to safeguard model ownership. This paper proposes a Collaborative Fine-Tuning (CraFT) approach for fine-tuning black-box VLMs to downstream tasks, where one only has access to the input prompts and the output predictions of the model. CraFT comprises two modules, a prompt generation module for learning text prompts and a prediction refinement module for enhancing output predictions in residual style. Additionally, we introduce an auxiliary prediction-consistent loss to promote consistent optimization across these modules. These modules are optimized by a novel collaborative training algorithm. Extensive experiments on few-shot classification over 15 datasets demonstrate the superiority of CraFT. The results show that CraFT achieves a decent gain of about 12% with 16-shot datasets and only 8,000 queries. Moreover, CraFT trains faster and uses only about 1/80 of the memory footprint for deployment, while sacrificing only 1.62% compared to the white-box method. Our code is publicly available at https://github.com/mrflogs/CraFT.} }
Endnote
%0 Conference Paper %T Connecting the Dots: Collaborative Fine-tuning for Black-Box Vision-Language Models %A Zhengbo Wang %A Jian Liang %A Ran He %A Zilei Wang %A Tieniu Tan %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-wang24ao %I PMLR %P 50931--50943 %U https://proceedings.mlr.press/v235/wang24ao.html %V 235 %X With the emergence of pretrained vision-language models (VLMs), considerable efforts have been devoted to fine-tuning them for downstream tasks. Despite the progress made in designing efficient fine-tuning methods, such methods require access to the model’s parameters, which can be challenging as model owners often opt to provide their models as a black box to safeguard model ownership. This paper proposes a Collaborative Fine-Tuning (CraFT) approach for fine-tuning black-box VLMs to downstream tasks, where one only has access to the input prompts and the output predictions of the model. CraFT comprises two modules, a prompt generation module for learning text prompts and a prediction refinement module for enhancing output predictions in residual style. Additionally, we introduce an auxiliary prediction-consistent loss to promote consistent optimization across these modules. These modules are optimized by a novel collaborative training algorithm. Extensive experiments on few-shot classification over 15 datasets demonstrate the superiority of CraFT. The results show that CraFT achieves a decent gain of about 12% with 16-shot datasets and only 8,000 queries. Moreover, CraFT trains faster and uses only about 1/80 of the memory footprint for deployment, while sacrificing only 1.62% compared to the white-box method. Our code is publicly available at https://github.com/mrflogs/CraFT.
APA
Wang, Z., Liang, J., He, R., Wang, Z. & Tan, T.. (2024). Connecting the Dots: Collaborative Fine-tuning for Black-Box Vision-Language Models. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:50931-50943 Available from https://proceedings.mlr.press/v235/wang24ao.html.

Related Material