[edit]
Classroom Observation Evaluation with Large Language Models
Proceedings of the Innovation and Responsibility in AI-Supported Education Workshop, PMLR 273:83-93, 2025.
Abstract
To improve efficiency of evaluating classroom Instructional Support (IS) and enhance the reliability of the IS score evaluation system, we proposed a novel annotation protocol based on classroom discourse types and a framework which employed large language models (LLM) as the core component to estimate IS score automatically. We constructed the SentTag dataset which was annotated by the proposed annotation protocol. The Fleiss’ Kappa among all annotators was 0.7120. Additionally Llama 3.1 models were fine-tuned on this dataset, achieving an accuracy of 0.7864 in classifying discourse types. While these features were not able to predict IS scores accurately (RMSE = 2.6584 and PCC = 0.1197), they could potentially serve as useful qualitative feedback to teachers on their classroom discourse. Future research will explore the integration of multimodal features, local session characteristics, and the generalization of the framework to diverse classroom settings.