Unlocking the Capabilities of Large Vision-Language Models for Generalizable and Explainable Deepfake Detection

Peipeng Yu; Jianwei Fei; Hui Gao; Xuan Feng; Zhihua Xia; Chip Hong Chang

Unlocking the Capabilities of Large Vision-Language Models for Generalizable and Explainable Deepfake Detection

Peipeng Yu, Jianwei Fei, Hui Gao, Xuan Feng, Zhihua Xia, Chip Hong Chang

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:72925-72943, 2025.

Abstract

Current Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in understanding multimodal data, but their potential remains underexplored for deepfake detection due to the misalignment of their knowledge and forensics patterns. To this end, we present a novel framework that unlocks LVLMs’ potential capabilities for deepfake detection. Our framework includes a Knowledge-guided Forgery Detector (KFD), a Forgery Prompt Learner (FPL), and a Large Language Model (LLM). The KFD is used to calculate correlations between image features and pristine/deepfake image description embeddings, enabling forgery classification and localization. The outputs of the KFD are subsequently processed by the Forgery Prompt Learner to construct fine-grained forgery prompt embeddings. These embeddings, along with visual and question prompt embeddings, are fed into the LLM to generate textual detection responses. Extensive experiments on multiple benchmarks, including FF++, CDF2, DFD, DFDCP, DFDC, and DF40, demonstrate that our scheme surpasses state-of-the-art methods in generalization performance, while also supporting multi-turn dialogue capabilities.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-yu25d,
  title = 	 {Unlocking the Capabilities of Large Vision-Language Models for Generalizable and Explainable Deepfake Detection},
  author =       {Yu, Peipeng and Fei, Jianwei and Gao, Hui and Feng, Xuan and Xia, Zhihua and Chang, Chip Hong},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {72925--72943},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/yu25d/yu25d.pdf},
  url = 	 {https://proceedings.mlr.press/v267/yu25d.html},
  abstract = 	 {Current Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in understanding multimodal data, but their potential remains underexplored for deepfake detection due to the misalignment of their knowledge and forensics patterns. To this end, we present a novel framework that unlocks LVLMs’ potential capabilities for deepfake detection. Our framework includes a Knowledge-guided Forgery Detector (KFD), a Forgery Prompt Learner (FPL), and a Large Language Model (LLM). The KFD is used to calculate correlations between image features and pristine/deepfake image description embeddings, enabling forgery classification and localization. The outputs of the KFD are subsequently processed by the Forgery Prompt Learner to construct fine-grained forgery prompt embeddings. These embeddings, along with visual and question prompt embeddings, are fed into the LLM to generate textual detection responses. Extensive experiments on multiple benchmarks, including FF++, CDF2, DFD, DFDCP, DFDC, and DF40, demonstrate that our scheme surpasses state-of-the-art methods in generalization performance, while also supporting multi-turn dialogue capabilities.}
}

Endnote

%0 Conference Paper
%T Unlocking the Capabilities of Large Vision-Language Models for Generalizable and Explainable Deepfake Detection
%A Peipeng Yu
%A Jianwei Fei
%A Hui Gao
%A Xuan Feng
%A Zhihua Xia
%A Chip Hong Chang
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-yu25d
%I PMLR
%P 72925--72943
%U https://proceedings.mlr.press/v267/yu25d.html
%V 267
%X Current Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in understanding multimodal data, but their potential remains underexplored for deepfake detection due to the misalignment of their knowledge and forensics patterns. To this end, we present a novel framework that unlocks LVLMs’ potential capabilities for deepfake detection. Our framework includes a Knowledge-guided Forgery Detector (KFD), a Forgery Prompt Learner (FPL), and a Large Language Model (LLM). The KFD is used to calculate correlations between image features and pristine/deepfake image description embeddings, enabling forgery classification and localization. The outputs of the KFD are subsequently processed by the Forgery Prompt Learner to construct fine-grained forgery prompt embeddings. These embeddings, along with visual and question prompt embeddings, are fed into the LLM to generate textual detection responses. Extensive experiments on multiple benchmarks, including FF++, CDF2, DFD, DFDCP, DFDC, and DF40, demonstrate that our scheme surpasses state-of-the-art methods in generalization performance, while also supporting multi-turn dialogue capabilities.

APA

Yu, P., Fei, J., Gao, H., Feng, X., Xia, Z. & Chang, C.H.. (2025). Unlocking the Capabilities of Large Vision-Language Models for Generalizable and Explainable Deepfake Detection. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:72925-72943 Available from https://proceedings.mlr.press/v267/yu25d.html.

Unlocking the Capabilities of Large Vision-Language Models for Generalizable and Explainable Deepfake Detection

Abstract

Cite this Paper

Related Material