Evaluating Fairness in AI-Assisted Remote Proctoring

William Belzak, Jill Burstein, Alina A. von Davier
Proceedings of the Innovation and Responsibility in AI-Supported Education Workshop, PMLR 273:125-132, 2025.

Abstract

Remote proctors make decisions about whether test takers have violated testing rules and, as a result, whether to certify test takers’ scores. These decisions rely on both AI signals and human evaluation of test-taking behaviors. Given that fairness is a key component of test validity evidence, it is critical that proctors’ decisions are unbiased with respect to proctor and test-taker background characteristics (e.g., gender, age, and nationality). In this study, we empirically evaluate whether proctor or test-taker background characteristics affect whether a test taker is flagged for rule violations. Results suggest that proctor and test-taker nationality may influence proctoring decisions, whereas gender and age do not. The direction of the influence generally reflects an “in-group, out-group” bias: proctors are less likely to identify rule violations among test takers with similar nationalities as proctors (in-group favoring) and more likely to identify rule violations among test takers of different nationalities (out-group disfavoring). Results also suggest that decisions based on AI signals may be less prone to in-group/out-group bias than decisions based on human evaluation only, although more research is needed to support this finding.

Cite this Paper


BibTeX
@InProceedings{pmlr-v273-belzak25a, title = {Evaluating Fairness in AI-Assisted Remote Proctoring}, author = {Belzak, William and Burstein, Jill and von Davier, Alina A.}, booktitle = {Proceedings of the Innovation and Responsibility in AI-Supported Education Workshop}, pages = {125--132}, year = {2025}, editor = {Wang, Zichao and Woodhead, Simon and Ananda, Muktha and Mallick, Debshila Basu and Sharpnack, James and Burstein, Jill}, volume = {273}, series = {Proceedings of Machine Learning Research}, month = {03 Mar}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v273/main/assets/belzak25a/belzak25a.pdf}, url = {https://proceedings.mlr.press/v273/belzak25a.html}, abstract = {Remote proctors make decisions about whether test takers have violated testing rules and, as a result, whether to certify test takers’ scores. These decisions rely on both AI signals and human evaluation of test-taking behaviors. Given that fairness is a key component of test validity evidence, it is critical that proctors’ decisions are unbiased with respect to proctor and test-taker background characteristics (e.g., gender, age, and nationality). In this study, we empirically evaluate whether proctor or test-taker background characteristics affect whether a test taker is flagged for rule violations. Results suggest that proctor and test-taker nationality may influence proctoring decisions, whereas gender and age do not. The direction of the influence generally reflects an “in-group, out-group” bias: proctors are less likely to identify rule violations among test takers with similar nationalities as proctors (in-group favoring) and more likely to identify rule violations among test takers of different nationalities (out-group disfavoring). Results also suggest that decisions based on AI signals may be less prone to in-group/out-group bias than decisions based on human evaluation only, although more research is needed to support this finding.} }
Endnote
%0 Conference Paper %T Evaluating Fairness in AI-Assisted Remote Proctoring %A William Belzak %A Jill Burstein %A Alina A. von Davier %B Proceedings of the Innovation and Responsibility in AI-Supported Education Workshop %C Proceedings of Machine Learning Research %D 2025 %E Zichao Wang %E Simon Woodhead %E Muktha Ananda %E Debshila Basu Mallick %E James Sharpnack %E Jill Burstein %F pmlr-v273-belzak25a %I PMLR %P 125--132 %U https://proceedings.mlr.press/v273/belzak25a.html %V 273 %X Remote proctors make decisions about whether test takers have violated testing rules and, as a result, whether to certify test takers’ scores. These decisions rely on both AI signals and human evaluation of test-taking behaviors. Given that fairness is a key component of test validity evidence, it is critical that proctors’ decisions are unbiased with respect to proctor and test-taker background characteristics (e.g., gender, age, and nationality). In this study, we empirically evaluate whether proctor or test-taker background characteristics affect whether a test taker is flagged for rule violations. Results suggest that proctor and test-taker nationality may influence proctoring decisions, whereas gender and age do not. The direction of the influence generally reflects an “in-group, out-group” bias: proctors are less likely to identify rule violations among test takers with similar nationalities as proctors (in-group favoring) and more likely to identify rule violations among test takers of different nationalities (out-group disfavoring). Results also suggest that decisions based on AI signals may be less prone to in-group/out-group bias than decisions based on human evaluation only, although more research is needed to support this finding.
APA
Belzak, W., Burstein, J. & von Davier, A.A.. (2025). Evaluating Fairness in AI-Assisted Remote Proctoring. Proceedings of the Innovation and Responsibility in AI-Supported Education Workshop, in Proceedings of Machine Learning Research 273:125-132 Available from https://proceedings.mlr.press/v273/belzak25a.html.

Related Material