Assessing Large Language Models for Automated Feedback Generation in Learning Programming Problem Solving

Priscylla Silva, Evandro Costa
Proceedings of the Innovation and Responsibility in AI-Supported Education Workshop, PMLR 273:116-124, 2025.

Abstract

Providing effective feedback is important for student learning in programming problem-solving. In this sense, Large Language Models (LLMs) have emerged as potential tools to automate feedback generation. However, their reliability and ability to identify reasoning errors in student code remain not well understood. This study evaluates the performance of four LLMs (GPT-4o, GPT-4o mini, GPT-4-Turbo, and Gemini-1.5-pro) on a benchmark dataset of 45 student solutions. We assessed the models’ capacity to provide accurate and insightful feedback, particularly in identifying reasoning mistakes. Our analysis reveals that 63% of feedback hints were accurate and complete, while 37% contained mistakes, including incorrect line identification, flawed explanations, or hallucinated issues. These findings highlight the potential and limitations of LLMs in programming education and underscore the need for improvements to enhance reliability and minimize risks in educational applications.

Cite this Paper


BibTeX
@InProceedings{pmlr-v273-silva25a, title = {Assessing Large Language Models for Automated Feedback Generation in Learning Programming Problem Solving}, author = {Silva, Priscylla and Costa, Evandro}, booktitle = {Proceedings of the Innovation and Responsibility in AI-Supported Education Workshop}, pages = {116--124}, year = {2025}, editor = {Wang, Zichao and Woodhead, Simon and Ananda, Muktha and Mallick, Debshila Basu and Sharpnack, James and Burstein, Jill}, volume = {273}, series = {Proceedings of Machine Learning Research}, month = {03 Mar}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v273/main/assets/silva25a/silva25a.pdf}, url = {https://proceedings.mlr.press/v273/silva25a.html}, abstract = {Providing effective feedback is important for student learning in programming problem-solving. In this sense, Large Language Models (LLMs) have emerged as potential tools to automate feedback generation. However, their reliability and ability to identify reasoning errors in student code remain not well understood. This study evaluates the performance of four LLMs (GPT-4o, GPT-4o mini, GPT-4-Turbo, and Gemini-1.5-pro) on a benchmark dataset of 45 student solutions. We assessed the models’ capacity to provide accurate and insightful feedback, particularly in identifying reasoning mistakes. Our analysis reveals that 63% of feedback hints were accurate and complete, while 37% contained mistakes, including incorrect line identification, flawed explanations, or hallucinated issues. These findings highlight the potential and limitations of LLMs in programming education and underscore the need for improvements to enhance reliability and minimize risks in educational applications.} }
Endnote
%0 Conference Paper %T Assessing Large Language Models for Automated Feedback Generation in Learning Programming Problem Solving %A Priscylla Silva %A Evandro Costa %B Proceedings of the Innovation and Responsibility in AI-Supported Education Workshop %C Proceedings of Machine Learning Research %D 2025 %E Zichao Wang %E Simon Woodhead %E Muktha Ananda %E Debshila Basu Mallick %E James Sharpnack %E Jill Burstein %F pmlr-v273-silva25a %I PMLR %P 116--124 %U https://proceedings.mlr.press/v273/silva25a.html %V 273 %X Providing effective feedback is important for student learning in programming problem-solving. In this sense, Large Language Models (LLMs) have emerged as potential tools to automate feedback generation. However, their reliability and ability to identify reasoning errors in student code remain not well understood. This study evaluates the performance of four LLMs (GPT-4o, GPT-4o mini, GPT-4-Turbo, and Gemini-1.5-pro) on a benchmark dataset of 45 student solutions. We assessed the models’ capacity to provide accurate and insightful feedback, particularly in identifying reasoning mistakes. Our analysis reveals that 63% of feedback hints were accurate and complete, while 37% contained mistakes, including incorrect line identification, flawed explanations, or hallucinated issues. These findings highlight the potential and limitations of LLMs in programming education and underscore the need for improvements to enhance reliability and minimize risks in educational applications.
APA
Silva, P. & Costa, E.. (2025). Assessing Large Language Models for Automated Feedback Generation in Learning Programming Problem Solving. Proceedings of the Innovation and Responsibility in AI-Supported Education Workshop, in Proceedings of Machine Learning Research 273:116-124 Available from https://proceedings.mlr.press/v273/silva25a.html.

Related Material