Calibrated Large Language Models for Binary Question Answering

Patrizio Giovannotti, Alexander Gammerman
Proceedings of the Thirteenth Symposium on Conformal and Probabilistic Prediction with Applications, PMLR 230:218-235, 2024.

Abstract

Quantifying the uncertainty of predictions made by large language models (LLMs) in binary text classification tasks remains a challenge. Calibration, in the context of LLMs, refers to the alignment between the model’s predicted probabilities and the actual correctness of its predictions. A well-calibrated model should produce probabilities that accurately reflect the likelihood of its predictions being correct. We propose a novel approach that utilizes the inductive Venn–Abers predictor (IVAP) to calibrate the probabilities associated with the output tokens corresponding to the binary labels. Our experiments on the BoolQ dataset using the Llama 2 model demonstrate that IVAP consistently outperforms the commonly used temperature scaling method for various label token choices, achieving well-calibrated probabilities while maintaining high predictive quality. Our findings contribute to the understanding of calibration techniques for LLMs and provide a practical solution for obtaining reliable uncertainty estimates in binary question answering tasks, enhancing the interpretability and trustworthiness of LLM predictions.

Cite this Paper


BibTeX
@InProceedings{pmlr-v230-giovannotti24a, title = {Calibrated Large Language Models for Binary Question Answering}, author = {Giovannotti, Patrizio and Gammerman, Alexander}, booktitle = {Proceedings of the Thirteenth Symposium on Conformal and Probabilistic Prediction with Applications}, pages = {218--235}, year = {2024}, editor = {Vantini, Simone and Fontana, Matteo and Solari, Aldo and Boström, Henrik and Carlsson, Lars}, volume = {230}, series = {Proceedings of Machine Learning Research}, month = {09--11 Sep}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v230/main/assets/giovannotti24a/giovannotti24a.pdf}, url = {https://proceedings.mlr.press/v230/giovannotti24a.html}, abstract = {Quantifying the uncertainty of predictions made by large language models (LLMs) in binary text classification tasks remains a challenge. Calibration, in the context of LLMs, refers to the alignment between the model’s predicted probabilities and the actual correctness of its predictions. A well-calibrated model should produce probabilities that accurately reflect the likelihood of its predictions being correct. We propose a novel approach that utilizes the inductive Venn–Abers predictor (IVAP) to calibrate the probabilities associated with the output tokens corresponding to the binary labels. Our experiments on the BoolQ dataset using the Llama 2 model demonstrate that IVAP consistently outperforms the commonly used temperature scaling method for various label token choices, achieving well-calibrated probabilities while maintaining high predictive quality. Our findings contribute to the understanding of calibration techniques for LLMs and provide a practical solution for obtaining reliable uncertainty estimates in binary question answering tasks, enhancing the interpretability and trustworthiness of LLM predictions.} }
Endnote
%0 Conference Paper %T Calibrated Large Language Models for Binary Question Answering %A Patrizio Giovannotti %A Alexander Gammerman %B Proceedings of the Thirteenth Symposium on Conformal and Probabilistic Prediction with Applications %C Proceedings of Machine Learning Research %D 2024 %E Simone Vantini %E Matteo Fontana %E Aldo Solari %E Henrik Boström %E Lars Carlsson %F pmlr-v230-giovannotti24a %I PMLR %P 218--235 %U https://proceedings.mlr.press/v230/giovannotti24a.html %V 230 %X Quantifying the uncertainty of predictions made by large language models (LLMs) in binary text classification tasks remains a challenge. Calibration, in the context of LLMs, refers to the alignment between the model’s predicted probabilities and the actual correctness of its predictions. A well-calibrated model should produce probabilities that accurately reflect the likelihood of its predictions being correct. We propose a novel approach that utilizes the inductive Venn–Abers predictor (IVAP) to calibrate the probabilities associated with the output tokens corresponding to the binary labels. Our experiments on the BoolQ dataset using the Llama 2 model demonstrate that IVAP consistently outperforms the commonly used temperature scaling method for various label token choices, achieving well-calibrated probabilities while maintaining high predictive quality. Our findings contribute to the understanding of calibration techniques for LLMs and provide a practical solution for obtaining reliable uncertainty estimates in binary question answering tasks, enhancing the interpretability and trustworthiness of LLM predictions.
APA
Giovannotti, P. & Gammerman, A.. (2024). Calibrated Large Language Models for Binary Question Answering. Proceedings of the Thirteenth Symposium on Conformal and Probabilistic Prediction with Applications, in Proceedings of Machine Learning Research 230:218-235 Available from https://proceedings.mlr.press/v230/giovannotti24a.html.

Related Material