Gemini Pro Defeated by GPT-4V: Evidence from Education

Gyeonggeon Lee, Lehong Shi, Ehsan Latif, Xiaoming Zhai
Proceedings of Large Foundation Models for Educational Assessment, PMLR 264:33-60, 2025.

Abstract

This study compared the classification performance of Gemini Pro and GPT-4V in educational settings. Employing visual question-answering (VQA) techniques, the study examined both models’ ability to read text-based rubrics and automatically score student-drawn models in science education. We employed quantitative and qualitative analyses using a dataset derived from student-drawn scientific models and NERIF (Notation-Enhanced Rubrics for Image Feedback) prompting methods. The findings reveal that GPT-4V significantly outperforms Gemini Pro regarding scoring accuracy and quadratic weighted kappa. The qualitative analysis shows that the differences may be due to the models’ ability to process fine-grained texts in images and overall image classification performance. Even adapting the NERIF approach by further de-sizing the input images, Gemini Pro seems unable to perform as well as GPT-4V. The findings suggest GPT-4V’s superior capability in handling complex multimodal educational tasks. The study concludes that while both models represent advancements in AI, GPT-4V’s higher performance makes it a more suitable tool for educational applications involving multimodal data interpretation.

Cite this Paper


BibTeX
@InProceedings{pmlr-v264-lee25a, title = {Gemini Pro Defeated by GPT-4V: Evidence from Education}, author = {Lee, Gyeonggeon and Shi, Lehong and Latif, Ehsan and Zhai, Xiaoming}, booktitle = {Proceedings of Large Foundation Models for Educational Assessment}, pages = {33--60}, year = {2025}, editor = {Li, Sheng and Cui, Zhongmin and Lu, Jiasen and Harris, Deborah and Jing, Shumin}, volume = {264}, series = {Proceedings of Machine Learning Research}, month = {15--16 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v264/main/assets/lee25a/lee25a.pdf}, url = {https://proceedings.mlr.press/v264/lee25a.html}, abstract = {This study compared the classification performance of Gemini Pro and GPT-4V in educational settings. Employing visual question-answering (VQA) techniques, the study examined both models’ ability to read text-based rubrics and automatically score student-drawn models in science education. We employed quantitative and qualitative analyses using a dataset derived from student-drawn scientific models and NERIF (Notation-Enhanced Rubrics for Image Feedback) prompting methods. The findings reveal that GPT-4V significantly outperforms Gemini Pro regarding scoring accuracy and quadratic weighted kappa. The qualitative analysis shows that the differences may be due to the models’ ability to process fine-grained texts in images and overall image classification performance. Even adapting the NERIF approach by further de-sizing the input images, Gemini Pro seems unable to perform as well as GPT-4V. The findings suggest GPT-4V’s superior capability in handling complex multimodal educational tasks. The study concludes that while both models represent advancements in AI, GPT-4V’s higher performance makes it a more suitable tool for educational applications involving multimodal data interpretation.} }
Endnote
%0 Conference Paper %T Gemini Pro Defeated by GPT-4V: Evidence from Education %A Gyeonggeon Lee %A Lehong Shi %A Ehsan Latif %A Xiaoming Zhai %B Proceedings of Large Foundation Models for Educational Assessment %C Proceedings of Machine Learning Research %D 2025 %E Sheng Li %E Zhongmin Cui %E Jiasen Lu %E Deborah Harris %E Shumin Jing %F pmlr-v264-lee25a %I PMLR %P 33--60 %U https://proceedings.mlr.press/v264/lee25a.html %V 264 %X This study compared the classification performance of Gemini Pro and GPT-4V in educational settings. Employing visual question-answering (VQA) techniques, the study examined both models’ ability to read text-based rubrics and automatically score student-drawn models in science education. We employed quantitative and qualitative analyses using a dataset derived from student-drawn scientific models and NERIF (Notation-Enhanced Rubrics for Image Feedback) prompting methods. The findings reveal that GPT-4V significantly outperforms Gemini Pro regarding scoring accuracy and quadratic weighted kappa. The qualitative analysis shows that the differences may be due to the models’ ability to process fine-grained texts in images and overall image classification performance. Even adapting the NERIF approach by further de-sizing the input images, Gemini Pro seems unable to perform as well as GPT-4V. The findings suggest GPT-4V’s superior capability in handling complex multimodal educational tasks. The study concludes that while both models represent advancements in AI, GPT-4V’s higher performance makes it a more suitable tool for educational applications involving multimodal data interpretation.
APA
Lee, G., Shi, L., Latif, E. & Zhai, X.. (2025). Gemini Pro Defeated by GPT-4V: Evidence from Education. Proceedings of Large Foundation Models for Educational Assessment, in Proceedings of Machine Learning Research 264:33-60 Available from https://proceedings.mlr.press/v264/lee25a.html.

Related Material