Automated Feedback Generation for Open-Ended Questions: Insights from Fine-Tuned LLMs

Elisabetta Mazzullo; Okan Bulut

Automated Feedback Generation for Open-Ended Questions: Insights from Fine-Tuned LLMs

Elisabetta Mazzullo, Okan Bulut

Proceedings of Large Foundation Models for Educational Assessment, PMLR 264:103-120, 2025.

Abstract

Timely, personalized, and actionable feedback is essential for effective learning but challenging to deliver at scale. Automated feedback generation (AFG) using large language models (LLMs) can be a promising solution to address this challenge. While existing studies using out-of-the-box LLMs and prompting strategies have shown promise, there is room for improvement. This study investigates the fine-tuning of OpenAI’s GPT-3.5-turbo for AFG. We developed feedback for open-ended situational judgment questions, and this small set of hand-crafted feedback examples was used to fine-tune the pre-trained LLM using specific prompting strategies. Our evaluation, conducted by independent judges and test experts, found that the feedback generated by our fine-tuned GPT-3.5-turbo model achieved high user satisfaction (84.8Timely, personalized, and actionable feedback is essential for effective learning but challenging to deliver at scale. Automated feedback generation (AFG) using large language models (LLMs) can be a promising solution to address this challenge. While existing studies using out-of-the-box LLMs and prompting strategies have shown promise, there is room for improvement. This study investigates the fine-tuning of OpenAI’s GPT-3.5-turbo for AFG. We developed feedback for open-ended situational judgment questions, and this small set of hand-crafted feedback examples was used to fine-tune the pre-trained LLM using specific prompting strategies. Our evaluation, conducted by independent judges and test experts, found that the feedback generated by our fine-tuned GPT-3.5-turbo model achieved high user satisfaction (84.8%) and met key structural quality criteria (72.9%). Also, the model generalized effectively across different items, providing feedback consistent with instructions, regardless of the respondent’s performance level, English proficiency, or student status. However, some feedback statements still contained linguistic errors, lacked focused suggestions, or seemed generic. We discuss potential solutions to these issues, along with implications for developing LLM-supported AFG systems and their adoption in high-stakes settings.

Cite this Paper

BibTeX

@InProceedings{pmlr-v264-mazzullo25a,
  title = 	 {Automated Feedback Generation for Open-Ended Questions: Insights from Fine-Tuned LLMs},
  author =       {Mazzullo, Elisabetta and Bulut, Okan},
  booktitle = 	 {Proceedings of Large Foundation Models for Educational Assessment},
  pages = 	 {103--120},
  year = 	 {2025},
  editor = 	 {Li, Sheng and Cui, Zhongmin and Lu, Jiasen and Harris, Deborah and Jing, Shumin},
  volume = 	 {264},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {15--16 Dec},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v264/main/assets/mazzullo25a/mazzullo25a.pdf},
  url = 	 {https://proceedings.mlr.press/v264/mazzullo25a.html},
  abstract = 	 {Timely, personalized, and actionable feedback is essential for effective learning but challenging to deliver at scale. Automated feedback generation (AFG) using large language models (LLMs) can be a promising solution to address this challenge. While existing studies using out-of-the-box LLMs and prompting strategies have shown promise, there is room for improvement. This study investigates the fine-tuning of OpenAI’s GPT-3.5-turbo for AFG. We developed feedback for open-ended situational judgment questions, and this small set of hand-crafted feedback examples was used to fine-tune the pre-trained LLM using specific prompting strategies. Our evaluation, conducted by independent judges and test experts, found that the feedback generated by our fine-tuned GPT-3.5-turbo model achieved high user satisfaction (84.8Timely, personalized, and actionable feedback is essential for effective learning but challenging to deliver at scale. Automated feedback generation (AFG) using large language models (LLMs) can be a promising solution to address this challenge. While existing studies using out-of-the-box LLMs and prompting strategies have shown promise, there is room for improvement. This study investigates the fine-tuning of OpenAI’s GPT-3.5-turbo for AFG. We developed feedback for open-ended situational judgment questions, and this small set of hand-crafted feedback examples was used to fine-tune the pre-trained LLM using specific prompting strategies. Our evaluation, conducted by independent judges and test experts, found that the feedback generated by our fine-tuned GPT-3.5-turbo model achieved high user satisfaction (84.8%) and met key structural quality criteria (72.9%). Also, the model generalized effectively across different items, providing feedback consistent with instructions, regardless of the respondent’s performance level, English proficiency, or student status. However, some feedback statements still contained linguistic errors, lacked focused suggestions, or seemed generic. We discuss potential solutions to these issues, along with implications for developing LLM-supported AFG systems and their adoption in high-stakes settings.}
}

Endnote

%0 Conference Paper
%T Automated Feedback Generation for Open-Ended Questions: Insights from Fine-Tuned LLMs
%A Elisabetta Mazzullo
%A Okan Bulut
%B Proceedings of Large Foundation Models for Educational Assessment
%C Proceedings of Machine Learning Research
%D 2025
%E Sheng Li
%E Zhongmin Cui
%E Jiasen Lu
%E Deborah Harris
%E Shumin Jing	
%F pmlr-v264-mazzullo25a
%I PMLR
%P 103--120
%U https://proceedings.mlr.press/v264/mazzullo25a.html
%V 264
%X Timely, personalized, and actionable feedback is essential for effective learning but challenging to deliver at scale. Automated feedback generation (AFG) using large language models (LLMs) can be a promising solution to address this challenge. While existing studies using out-of-the-box LLMs and prompting strategies have shown promise, there is room for improvement. This study investigates the fine-tuning of OpenAI’s GPT-3.5-turbo for AFG. We developed feedback for open-ended situational judgment questions, and this small set of hand-crafted feedback examples was used to fine-tune the pre-trained LLM using specific prompting strategies. Our evaluation, conducted by independent judges and test experts, found that the feedback generated by our fine-tuned GPT-3.5-turbo model achieved high user satisfaction (84.8Timely, personalized, and actionable feedback is essential for effective learning but challenging to deliver at scale. Automated feedback generation (AFG) using large language models (LLMs) can be a promising solution to address this challenge. While existing studies using out-of-the-box LLMs and prompting strategies have shown promise, there is room for improvement. This study investigates the fine-tuning of OpenAI’s GPT-3.5-turbo for AFG. We developed feedback for open-ended situational judgment questions, and this small set of hand-crafted feedback examples was used to fine-tune the pre-trained LLM using specific prompting strategies. Our evaluation, conducted by independent judges and test experts, found that the feedback generated by our fine-tuned GPT-3.5-turbo model achieved high user satisfaction (84.8%) and met key structural quality criteria (72.9%). Also, the model generalized effectively across different items, providing feedback consistent with instructions, regardless of the respondent’s performance level, English proficiency, or student status. However, some feedback statements still contained linguistic errors, lacked focused suggestions, or seemed generic. We discuss potential solutions to these issues, along with implications for developing LLM-supported AFG systems and their adoption in high-stakes settings.

APA

Mazzullo, E. & Bulut, O.. (2025). Automated Feedback Generation for Open-Ended Questions: Insights from Fine-Tuned LLMs. Proceedings of Large Foundation Models for Educational Assessment, in Proceedings of Machine Learning Research 264:103-120 Available from https://proceedings.mlr.press/v264/mazzullo25a.html.

Related Material

Download PDF