Classroom Observation Evaluation with Large Language Models

Jiani Wang, Kamil Hankour, Yuqi Zhang, Jennifer LoCasale-Crouch, Jacob Whitehill
Proceedings of the Innovation and Responsibility in AI-Supported Education Workshop, PMLR 273:83-93, 2025.

Abstract

To improve efficiency of evaluating classroom Instructional Support (IS) and enhance the reliability of the IS score evaluation system, we proposed a novel annotation protocol based on classroom discourse types and a framework which employed large language models (LLM) as the core component to estimate IS score automatically. We constructed the SentTag dataset which was annotated by the proposed annotation protocol. The Fleiss’ Kappa among all annotators was 0.7120. Additionally Llama 3.1 models were fine-tuned on this dataset, achieving an accuracy of 0.7864 in classifying discourse types. While these features were not able to predict IS scores accurately (RMSE = 2.6584 and PCC = 0.1197), they could potentially serve as useful qualitative feedback to teachers on their classroom discourse. Future research will explore the integration of multimodal features, local session characteristics, and the generalization of the framework to diverse classroom settings.

Cite this Paper


BibTeX
@InProceedings{pmlr-v273-wang25a, title = {Classroom Observation Evaluation with Large Language Models}, author = {Wang, Jiani and Hankour, Kamil and Zhang, Yuqi and LoCasale-Crouch, Jennifer and Whitehill, Jacob}, booktitle = {Proceedings of the Innovation and Responsibility in AI-Supported Education Workshop}, pages = {83--93}, year = {2025}, editor = {Wang, Zichao and Woodhead, Simon and Ananda, Muktha and Mallick, Debshila Basu and Sharpnack, James and Burstein, Jill}, volume = {273}, series = {Proceedings of Machine Learning Research}, month = {03 Mar}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v273/main/assets/wang25a/wang25a.pdf}, url = {https://proceedings.mlr.press/v273/wang25a.html}, abstract = {To improve efficiency of evaluating classroom Instructional Support (IS) and enhance the reliability of the IS score evaluation system, we proposed a novel annotation protocol based on classroom discourse types and a framework which employed large language models (LLM) as the core component to estimate IS score automatically. We constructed the SentTag dataset which was annotated by the proposed annotation protocol. The Fleiss’ Kappa among all annotators was 0.7120. Additionally Llama 3.1 models were fine-tuned on this dataset, achieving an accuracy of 0.7864 in classifying discourse types. While these features were not able to predict IS scores accurately (RMSE = 2.6584 and PCC = 0.1197), they could potentially serve as useful qualitative feedback to teachers on their classroom discourse. Future research will explore the integration of multimodal features, local session characteristics, and the generalization of the framework to diverse classroom settings.} }
Endnote
%0 Conference Paper %T Classroom Observation Evaluation with Large Language Models %A Jiani Wang %A Kamil Hankour %A Yuqi Zhang %A Jennifer LoCasale-Crouch %A Jacob Whitehill %B Proceedings of the Innovation and Responsibility in AI-Supported Education Workshop %C Proceedings of Machine Learning Research %D 2025 %E Zichao Wang %E Simon Woodhead %E Muktha Ananda %E Debshila Basu Mallick %E James Sharpnack %E Jill Burstein %F pmlr-v273-wang25a %I PMLR %P 83--93 %U https://proceedings.mlr.press/v273/wang25a.html %V 273 %X To improve efficiency of evaluating classroom Instructional Support (IS) and enhance the reliability of the IS score evaluation system, we proposed a novel annotation protocol based on classroom discourse types and a framework which employed large language models (LLM) as the core component to estimate IS score automatically. We constructed the SentTag dataset which was annotated by the proposed annotation protocol. The Fleiss’ Kappa among all annotators was 0.7120. Additionally Llama 3.1 models were fine-tuned on this dataset, achieving an accuracy of 0.7864 in classifying discourse types. While these features were not able to predict IS scores accurately (RMSE = 2.6584 and PCC = 0.1197), they could potentially serve as useful qualitative feedback to teachers on their classroom discourse. Future research will explore the integration of multimodal features, local session characteristics, and the generalization of the framework to diverse classroom settings.
APA
Wang, J., Hankour, K., Zhang, Y., LoCasale-Crouch, J. & Whitehill, J.. (2025). Classroom Observation Evaluation with Large Language Models. Proceedings of the Innovation and Responsibility in AI-Supported Education Workshop, in Proceedings of Machine Learning Research 273:83-93 Available from https://proceedings.mlr.press/v273/wang25a.html.

Related Material