BanditCAT and AutoIRT: Machine Learning Approaches to Computerized Adaptive Testing and Item Calibration

James Sharpnack, Kevin Hao, Phoebe Mulcaire, Klinton Bicknell, Geoff LaFlair, Kevin Yancey, Alina A. von Davier
Proceedings of Large Foundation Models for Educational Assessment, PMLR 264:121-135, 2025.

Abstract

In this paper, we present a complete framework for quickly calibrating and administering a robust large-scale computerized adaptive test (CAT) with a small number of responses. Calibration—learning item parameters in a test—is done using AutoIRT, a new method that uses automated machine learning (AutoML) in combination with item response theory (IRT), originally proposed in (Sharpnack et al., 2024). AutoIRT trains a non-parametric AutoML grading model using item features, followed by an item-specific parametric model, which results in an explanatory IRT model. In our work, we use tabular AutoML tools (AutoGluon.tabular) along with BERT embeddings and linguistically motivated NLP features. In this framework, we use Bayesian updating to obtain test taker ability posterior distributions for administration and scoring. For administration of our adaptive test, we propose BanditCAT, a method motivated by casting the problem in the contextual bandit framework and utilizing item response theory (IRT). The key insight lies in defining the bandit reward as the Fisher information for the selected item, given the latent test taker ability (In this paper, we present a complete framework for quickly calibrating and administering a robust large-scale computerized adaptive test (CAT) with a small number of responses. Calibration—learning item parameters in a test—is done using AutoIRT, a new method that uses automated machine learning (AutoML) in combination with item response theory (IRT), originally proposed in (Sharpnack et al., 2024). AutoIRT trains a non-parametric AutoML grading model using item features, followed by an item-specific parametric model, which results in an explanatory IRT model. In our work, we use tabular AutoML tools (AutoGluon.tabular) along with BERT embeddings and linguistically motivated NLP features. In this framework, we use Bayesian updating to obtain test taker ability posterior distributions for administration and scoring. For administration of our adaptive test, we propose BanditCAT, a method motivated by casting the problem in the contextual bandit framework and utilizing item response theory (IRT). The key insight lies in defining the bandit reward as the Fisher information for the selected item, given the latent test taker ability (\theta) from IRT assumptions. We use Thompson sampling to balance between exploring items with different psychometric characteristics and selecting highly discriminative items that give more precise information about \theta$. To control item exposure, we inject noise through an additional randomization step before computing the Fisher information. This framework was used to initially launch two new item types on the DET practice test using limited training data. We outline some validity, reliability, and exposure metrics for the 5 practice test experiments that utilized this framework.

Cite this Paper


BibTeX
@InProceedings{pmlr-v264-sharpnack25a, title = {BanditCAT and AutoIRT: Machine Learning Approaches to Computerized Adaptive Testing and Item Calibration}, author = {Sharpnack, James and Hao, Kevin and Mulcaire, Phoebe and Bicknell, Klinton and LaFlair, Geoff and Yancey, Kevin and von Davier, Alina A.}, booktitle = {Proceedings of Large Foundation Models for Educational Assessment}, pages = {121--135}, year = {2025}, editor = {Li, Sheng and Cui, Zhongmin and Lu, Jiasen and Harris, Deborah and Jing, Shumin}, volume = {264}, series = {Proceedings of Machine Learning Research}, month = {15--16 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v264/main/assets/sharpnack25a/sharpnack25a.pdf}, url = {https://proceedings.mlr.press/v264/sharpnack25a.html}, abstract = {In this paper, we present a complete framework for quickly calibrating and administering a robust large-scale computerized adaptive test (CAT) with a small number of responses. Calibration—learning item parameters in a test—is done using AutoIRT, a new method that uses automated machine learning (AutoML) in combination with item response theory (IRT), originally proposed in (Sharpnack et al., 2024). AutoIRT trains a non-parametric AutoML grading model using item features, followed by an item-specific parametric model, which results in an explanatory IRT model. In our work, we use tabular AutoML tools (AutoGluon.tabular) along with BERT embeddings and linguistically motivated NLP features. In this framework, we use Bayesian updating to obtain test taker ability posterior distributions for administration and scoring. For administration of our adaptive test, we propose BanditCAT, a method motivated by casting the problem in the contextual bandit framework and utilizing item response theory (IRT). The key insight lies in defining the bandit reward as the Fisher information for the selected item, given the latent test taker ability ($In this paper, we present a complete framework for quickly calibrating and administering a robust large-scale computerized adaptive test (CAT) with a small number of responses. Calibration—learning item parameters in a test—is done using AutoIRT, a new method that uses automated machine learning (AutoML) in combination with item response theory (IRT), originally proposed in (Sharpnack et al., 2024). AutoIRT trains a non-parametric AutoML grading model using item features, followed by an item-specific parametric model, which results in an explanatory IRT model. In our work, we use tabular AutoML tools (AutoGluon.tabular) along with BERT embeddings and linguistically motivated NLP features. In this framework, we use Bayesian updating to obtain test taker ability posterior distributions for administration and scoring. For administration of our adaptive test, we propose BanditCAT, a method motivated by casting the problem in the contextual bandit framework and utilizing item response theory (IRT). The key insight lies in defining the bandit reward as the Fisher information for the selected item, given the latent test taker ability ($\theta$) from IRT assumptions. We use Thompson sampling to balance between exploring items with different psychometric characteristics and selecting highly discriminative items that give more precise information about $\theta$. To control item exposure, we inject noise through an additional randomization step before computing the Fisher information. This framework was used to initially launch two new item types on the DET practice test using limited training data. We outline some validity, reliability, and exposure metrics for the 5 practice test experiments that utilized this framework.} }
Endnote
%0 Conference Paper %T BanditCAT and AutoIRT: Machine Learning Approaches to Computerized Adaptive Testing and Item Calibration %A James Sharpnack %A Kevin Hao %A Phoebe Mulcaire %A Klinton Bicknell %A Geoff LaFlair %A Kevin Yancey %A Alina A. von Davier %B Proceedings of Large Foundation Models for Educational Assessment %C Proceedings of Machine Learning Research %D 2025 %E Sheng Li %E Zhongmin Cui %E Jiasen Lu %E Deborah Harris %E Shumin Jing %F pmlr-v264-sharpnack25a %I PMLR %P 121--135 %U https://proceedings.mlr.press/v264/sharpnack25a.html %V 264 %X In this paper, we present a complete framework for quickly calibrating and administering a robust large-scale computerized adaptive test (CAT) with a small number of responses. Calibration—learning item parameters in a test—is done using AutoIRT, a new method that uses automated machine learning (AutoML) in combination with item response theory (IRT), originally proposed in (Sharpnack et al., 2024). AutoIRT trains a non-parametric AutoML grading model using item features, followed by an item-specific parametric model, which results in an explanatory IRT model. In our work, we use tabular AutoML tools (AutoGluon.tabular) along with BERT embeddings and linguistically motivated NLP features. In this framework, we use Bayesian updating to obtain test taker ability posterior distributions for administration and scoring. For administration of our adaptive test, we propose BanditCAT, a method motivated by casting the problem in the contextual bandit framework and utilizing item response theory (IRT). The key insight lies in defining the bandit reward as the Fisher information for the selected item, given the latent test taker ability ($In this paper, we present a complete framework for quickly calibrating and administering a robust large-scale computerized adaptive test (CAT) with a small number of responses. Calibration—learning item parameters in a test—is done using AutoIRT, a new method that uses automated machine learning (AutoML) in combination with item response theory (IRT), originally proposed in (Sharpnack et al., 2024). AutoIRT trains a non-parametric AutoML grading model using item features, followed by an item-specific parametric model, which results in an explanatory IRT model. In our work, we use tabular AutoML tools (AutoGluon.tabular) along with BERT embeddings and linguistically motivated NLP features. In this framework, we use Bayesian updating to obtain test taker ability posterior distributions for administration and scoring. For administration of our adaptive test, we propose BanditCAT, a method motivated by casting the problem in the contextual bandit framework and utilizing item response theory (IRT). The key insight lies in defining the bandit reward as the Fisher information for the selected item, given the latent test taker ability ($\theta$) from IRT assumptions. We use Thompson sampling to balance between exploring items with different psychometric characteristics and selecting highly discriminative items that give more precise information about $\theta$. To control item exposure, we inject noise through an additional randomization step before computing the Fisher information. This framework was used to initially launch two new item types on the DET practice test using limited training data. We outline some validity, reliability, and exposure metrics for the 5 practice test experiments that utilized this framework.
APA
Sharpnack, J., Hao, K., Mulcaire, P., Bicknell, K., LaFlair, G., Yancey, K. & von Davier, A.A.. (2025). BanditCAT and AutoIRT: Machine Learning Approaches to Computerized Adaptive Testing and Item Calibration. Proceedings of Large Foundation Models for Educational Assessment, in Proceedings of Machine Learning Research 264:121-135 Available from https://proceedings.mlr.press/v264/sharpnack25a.html.

Related Material