LLaMA32-Med: Parameter-Efficient Adaptation of Multimodal LLMs for Medical Visual Question Answering

Wanqi Dong, Jingze Ge, Wanyue Dong, Mehul Motani
Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, PMLR 315:4042-4056, 2026.

Abstract

Artificial intelligence has shown great promise in healthcare, particularly in diagnostic support. While healthcare data is inherently multimodal, existing models struggle to fully leverage diverse clinical data, e.g., images and text. Although recent Multimodal Large Language Models (MLLMs) exhibit strong potential, their performance in medical scenarios is constrained by training on general-domain data and the high computational cost of full-parameter adaptation. In this work, we present a two-stage lightweight adaptation framework for fine-tuning general-purpose MLLMs on medical multimodal tasks. Building on the LLaMA 3.2 Vision-Instruct model, we adopt parameter-efficient fine-tuning techniques that update less than 2% of the model parameters. This enables the injection of domain-specific medical knowledge while requiring approximately 20 GB of GPU memory. Furthermore, we design task-specific and role-based prompting strategies to better guide medical visual understanding tasks. Experimental results show that our approach achieves performance comparable to or surpassing state-of-the-art methods while significantly outperforming the original general-domain model. Comparative evaluations with recent MLLMs highlight the strong adaptability of the LLaMA 3.2 Vision-Instruct backbone, validating its effectiveness as a foundation for practical multimodal medical AI systems.

Cite this Paper


BibTeX
@InProceedings{pmlr-v315-dong26a, title = {LLaMA32-Med: Parameter-Efficient Adaptation of Multimodal LLMs for Medical Visual Question Answering}, author = {Dong, Wanqi and Ge, Jingze and Dong, Wanyue and Motani, Mehul}, booktitle = {Proceedings of The 9th International Conference on Medical Imaging with Deep Learning}, pages = {4042--4056}, year = {2026}, editor = {Huo, Yuankai and Gao, Mingchen and Kuo, Chang-Fu and Jin, Yueming and Deng, Ruining}, volume = {315}, series = {Proceedings of Machine Learning Research}, month = {08--10 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v315/main/assets/dong26a/dong26a.pdf}, url = {https://proceedings.mlr.press/v315/dong26a.html}, abstract = {Artificial intelligence has shown great promise in healthcare, particularly in diagnostic support. While healthcare data is inherently multimodal, existing models struggle to fully leverage diverse clinical data, e.g., images and text. Although recent Multimodal Large Language Models (MLLMs) exhibit strong potential, their performance in medical scenarios is constrained by training on general-domain data and the high computational cost of full-parameter adaptation. In this work, we present a two-stage lightweight adaptation framework for fine-tuning general-purpose MLLMs on medical multimodal tasks. Building on the LLaMA 3.2 Vision-Instruct model, we adopt parameter-efficient fine-tuning techniques that update less than 2% of the model parameters. This enables the injection of domain-specific medical knowledge while requiring approximately 20 GB of GPU memory. Furthermore, we design task-specific and role-based prompting strategies to better guide medical visual understanding tasks. Experimental results show that our approach achieves performance comparable to or surpassing state-of-the-art methods while significantly outperforming the original general-domain model. Comparative evaluations with recent MLLMs highlight the strong adaptability of the LLaMA 3.2 Vision-Instruct backbone, validating its effectiveness as a foundation for practical multimodal medical AI systems.} }
Endnote
%0 Conference Paper %T LLaMA32-Med: Parameter-Efficient Adaptation of Multimodal LLMs for Medical Visual Question Answering %A Wanqi Dong %A Jingze Ge %A Wanyue Dong %A Mehul Motani %B Proceedings of The 9th International Conference on Medical Imaging with Deep Learning %C Proceedings of Machine Learning Research %D 2026 %E Yuankai Huo %E Mingchen Gao %E Chang-Fu Kuo %E Yueming Jin %E Ruining Deng %F pmlr-v315-dong26a %I PMLR %P 4042--4056 %U https://proceedings.mlr.press/v315/dong26a.html %V 315 %X Artificial intelligence has shown great promise in healthcare, particularly in diagnostic support. While healthcare data is inherently multimodal, existing models struggle to fully leverage diverse clinical data, e.g., images and text. Although recent Multimodal Large Language Models (MLLMs) exhibit strong potential, their performance in medical scenarios is constrained by training on general-domain data and the high computational cost of full-parameter adaptation. In this work, we present a two-stage lightweight adaptation framework for fine-tuning general-purpose MLLMs on medical multimodal tasks. Building on the LLaMA 3.2 Vision-Instruct model, we adopt parameter-efficient fine-tuning techniques that update less than 2% of the model parameters. This enables the injection of domain-specific medical knowledge while requiring approximately 20 GB of GPU memory. Furthermore, we design task-specific and role-based prompting strategies to better guide medical visual understanding tasks. Experimental results show that our approach achieves performance comparable to or surpassing state-of-the-art methods while significantly outperforming the original general-domain model. Comparative evaluations with recent MLLMs highlight the strong adaptability of the LLaMA 3.2 Vision-Instruct backbone, validating its effectiveness as a foundation for practical multimodal medical AI systems.
APA
Dong, W., Ge, J., Dong, W. & Motani, M.. (2026). LLaMA32-Med: Parameter-Efficient Adaptation of Multimodal LLMs for Medical Visual Question Answering. Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, in Proceedings of Machine Learning Research 315:4042-4056 Available from https://proceedings.mlr.press/v315/dong26a.html.

Related Material