Towards Scalable Automated Grading: Leveraging Large Language Models for Conceptual Question Evaluation in Engineering

Rujun Gao, Xiaosu Guo, Xiaodi Li, Arun Balajiee Lekshmi Narayanan, Naveen Thomas, Arun R. Srinivasa
Proceedings of Large Foundation Models for Educational Assessment, PMLR 264:186-206, 2025.

Abstract

This study explores the feasibility of using large language models (LLMs), specifically GPT-4o (ChatGPT), for automated grading of conceptual questions in an undergraduate Mechanical Engineering course. We compared the grading performance of GPT-4o with that of human teaching assistants (TAs) on ten quiz problems from the MEEN 361 course at Texas AThis study explores the feasibility of using large language models (LLMs), specifically GPT-4o (ChatGPT), for automated grading of conceptual questions in an undergraduate Mechanical Engineering course. We compared the grading performance of GPT-4o with that of human teaching assistants (TAs) on ten quiz problems from the MEEN 361 course at Texas A&M University, each answered by approximately 225 students. Both the LLM and TAs followed the same instructor-provided rubric to ensure grading consistency. We evaluated performance using Spearman’s rank correlation coefficient and Root Mean Square Error (RMSE) to assess the alignment between rankings and the accuracy of scores assigned by GPT-4o and TAs under zero- and few-shot grading settings. In the zero-shot setting, GPT-4o demonstrated a strong correlation with TA grading, with Spearman’s rank correlation coefficient exceeding 0.6 in seven out of ten datasets and reaching a high of 0.9387. Our analysis reveals that GPT-4o performs well when grading criteria are straightforward but struggles with nuanced answers, particularly those involving synonyms not present in the rubric. The model also tends to grade more stringently in ambiguous cases compared to human TAs. Overall, ChatGPT shows promise as a tool for grading conceptual questions, offering scalability and consistency.

Cite this Paper


BibTeX
@InProceedings{pmlr-v264-gao25a, title = {Towards Scalable Automated Grading: Leveraging Large Language Models for Conceptual Question Evaluation in Engineering}, author = {Gao, Rujun and Guo, Xiaosu and Li, Xiaodi and Narayanan, Arun Balajiee Lekshmi and Thomas, Naveen and Srinivasa, Arun R.}, booktitle = {Proceedings of Large Foundation Models for Educational Assessment}, pages = {186--206}, year = {2025}, editor = {Li, Sheng and Cui, Zhongmin and Lu, Jiasen and Harris, Deborah and Jing, Shumin}, volume = {264}, series = {Proceedings of Machine Learning Research}, month = {15--16 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v264/main/assets/gao25a/gao25a.pdf}, url = {https://proceedings.mlr.press/v264/gao25a.html}, abstract = {This study explores the feasibility of using large language models (LLMs), specifically GPT-4o (ChatGPT), for automated grading of conceptual questions in an undergraduate Mechanical Engineering course. We compared the grading performance of GPT-4o with that of human teaching assistants (TAs) on ten quiz problems from the MEEN 361 course at Texas AThis study explores the feasibility of using large language models (LLMs), specifically GPT-4o (ChatGPT), for automated grading of conceptual questions in an undergraduate Mechanical Engineering course. We compared the grading performance of GPT-4o with that of human teaching assistants (TAs) on ten quiz problems from the MEEN 361 course at Texas A&M University, each answered by approximately 225 students. Both the LLM and TAs followed the same instructor-provided rubric to ensure grading consistency. We evaluated performance using Spearman’s rank correlation coefficient and Root Mean Square Error (RMSE) to assess the alignment between rankings and the accuracy of scores assigned by GPT-4o and TAs under zero- and few-shot grading settings. In the zero-shot setting, GPT-4o demonstrated a strong correlation with TA grading, with Spearman’s rank correlation coefficient exceeding 0.6 in seven out of ten datasets and reaching a high of 0.9387. Our analysis reveals that GPT-4o performs well when grading criteria are straightforward but struggles with nuanced answers, particularly those involving synonyms not present in the rubric. The model also tends to grade more stringently in ambiguous cases compared to human TAs. Overall, ChatGPT shows promise as a tool for grading conceptual questions, offering scalability and consistency.} }
Endnote
%0 Conference Paper %T Towards Scalable Automated Grading: Leveraging Large Language Models for Conceptual Question Evaluation in Engineering %A Rujun Gao %A Xiaosu Guo %A Xiaodi Li %A Arun Balajiee Lekshmi Narayanan %A Naveen Thomas %A Arun R. Srinivasa %B Proceedings of Large Foundation Models for Educational Assessment %C Proceedings of Machine Learning Research %D 2025 %E Sheng Li %E Zhongmin Cui %E Jiasen Lu %E Deborah Harris %E Shumin Jing %F pmlr-v264-gao25a %I PMLR %P 186--206 %U https://proceedings.mlr.press/v264/gao25a.html %V 264 %X This study explores the feasibility of using large language models (LLMs), specifically GPT-4o (ChatGPT), for automated grading of conceptual questions in an undergraduate Mechanical Engineering course. We compared the grading performance of GPT-4o with that of human teaching assistants (TAs) on ten quiz problems from the MEEN 361 course at Texas AThis study explores the feasibility of using large language models (LLMs), specifically GPT-4o (ChatGPT), for automated grading of conceptual questions in an undergraduate Mechanical Engineering course. We compared the grading performance of GPT-4o with that of human teaching assistants (TAs) on ten quiz problems from the MEEN 361 course at Texas A&M University, each answered by approximately 225 students. Both the LLM and TAs followed the same instructor-provided rubric to ensure grading consistency. We evaluated performance using Spearman’s rank correlation coefficient and Root Mean Square Error (RMSE) to assess the alignment between rankings and the accuracy of scores assigned by GPT-4o and TAs under zero- and few-shot grading settings. In the zero-shot setting, GPT-4o demonstrated a strong correlation with TA grading, with Spearman’s rank correlation coefficient exceeding 0.6 in seven out of ten datasets and reaching a high of 0.9387. Our analysis reveals that GPT-4o performs well when grading criteria are straightforward but struggles with nuanced answers, particularly those involving synonyms not present in the rubric. The model also tends to grade more stringently in ambiguous cases compared to human TAs. Overall, ChatGPT shows promise as a tool for grading conceptual questions, offering scalability and consistency.
APA
Gao, R., Guo, X., Li, X., Narayanan, A.B.L., Thomas, N. & Srinivasa, A.R.. (2025). Towards Scalable Automated Grading: Leveraging Large Language Models for Conceptual Question Evaluation in Engineering. Proceedings of Large Foundation Models for Educational Assessment, in Proceedings of Machine Learning Research 264:186-206 Available from https://proceedings.mlr.press/v264/gao25a.html.

Related Material