Risk Aware Benchmarking of Large Language Models

Apoorva Nitsure, Youssef Mroueh, Mattia Rigotti, Kristjan Greenewald, Brian Belgodere, Mikhail Yurochkin, Jiri Navratil, Igor Melnyk, Jarret Ross
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:38264-38297, 2024.

Abstract

We propose a distributional framework for benchmarking socio-technical risks of foundation models with quantified statistical significance. Our approach hinges on a new statistical relative testing based on first and second order stochastic dominance of real random variables. We show that the second order statistics in this test are linked to mean-risk models commonly used in econometrics and mathematical finance to balance risk and utility when choosing between alternatives. Using this framework, we formally develop a risk-aware approach for foundation model selection given guardrails quantified by specified metrics. Inspired by portfolio optimization and selection theory in mathematical finance, we define a metrics portfolio for each model as a means to aggregate a collection of metrics, and perform model selection based on the stochastic dominance of these portfolios. The statistical significance of our tests is backed theoretically by an asymptotic analysis via central limit theorems instantiated in practice via a bootstrap variance estimate. We use our framework to compare various large language models regarding risks related to drifting from instructions and outputting toxic content.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-nitsure24a, title = {Risk Aware Benchmarking of Large Language Models}, author = {Nitsure, Apoorva and Mroueh, Youssef and Rigotti, Mattia and Greenewald, Kristjan and Belgodere, Brian and Yurochkin, Mikhail and Navratil, Jiri and Melnyk, Igor and Ross, Jarret}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {38264--38297}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/nitsure24a/nitsure24a.pdf}, url = {https://proceedings.mlr.press/v235/nitsure24a.html}, abstract = {We propose a distributional framework for benchmarking socio-technical risks of foundation models with quantified statistical significance. Our approach hinges on a new statistical relative testing based on first and second order stochastic dominance of real random variables. We show that the second order statistics in this test are linked to mean-risk models commonly used in econometrics and mathematical finance to balance risk and utility when choosing between alternatives. Using this framework, we formally develop a risk-aware approach for foundation model selection given guardrails quantified by specified metrics. Inspired by portfolio optimization and selection theory in mathematical finance, we define a metrics portfolio for each model as a means to aggregate a collection of metrics, and perform model selection based on the stochastic dominance of these portfolios. The statistical significance of our tests is backed theoretically by an asymptotic analysis via central limit theorems instantiated in practice via a bootstrap variance estimate. We use our framework to compare various large language models regarding risks related to drifting from instructions and outputting toxic content.} }
Endnote
%0 Conference Paper %T Risk Aware Benchmarking of Large Language Models %A Apoorva Nitsure %A Youssef Mroueh %A Mattia Rigotti %A Kristjan Greenewald %A Brian Belgodere %A Mikhail Yurochkin %A Jiri Navratil %A Igor Melnyk %A Jarret Ross %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-nitsure24a %I PMLR %P 38264--38297 %U https://proceedings.mlr.press/v235/nitsure24a.html %V 235 %X We propose a distributional framework for benchmarking socio-technical risks of foundation models with quantified statistical significance. Our approach hinges on a new statistical relative testing based on first and second order stochastic dominance of real random variables. We show that the second order statistics in this test are linked to mean-risk models commonly used in econometrics and mathematical finance to balance risk and utility when choosing between alternatives. Using this framework, we formally develop a risk-aware approach for foundation model selection given guardrails quantified by specified metrics. Inspired by portfolio optimization and selection theory in mathematical finance, we define a metrics portfolio for each model as a means to aggregate a collection of metrics, and perform model selection based on the stochastic dominance of these portfolios. The statistical significance of our tests is backed theoretically by an asymptotic analysis via central limit theorems instantiated in practice via a bootstrap variance estimate. We use our framework to compare various large language models regarding risks related to drifting from instructions and outputting toxic content.
APA
Nitsure, A., Mroueh, Y., Rigotti, M., Greenewald, K., Belgodere, B., Yurochkin, M., Navratil, J., Melnyk, I. & Ross, J.. (2024). Risk Aware Benchmarking of Large Language Models. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:38264-38297 Available from https://proceedings.mlr.press/v235/nitsure24a.html.

Related Material