Assessing Large Language Models on Climate Information

Jannis Bulian, Mike S. Schäfer, Afra Amini, Heidi Lam, Massimiliano Ciaramita, Ben Gaiarin, Michelle Chen Huebscher, Christian Buck, Niels G. Mede, Markus Leippold, Nadine Strauss
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:4884-4935, 2024.

Abstract

As Large Language Models (LLMs) rise in popularity, it is necessary to assess their capability in critically relevant domains. We present a comprehensive evaluation framework, grounded in science communication research, to assess LLM responses to questions about climate change. Our framework emphasizes both presentational and epistemological adequacy, offering a fine-grained analysis of LLM generations spanning 8 dimensions and 30 issues. Our evaluation task is a real-world example of a growing number of challenging problems where AI can complement and lift human performance. We introduce a novel protocol for scalable oversight that relies on AI Assistance and raters with relevant education. We evaluate several recent LLMs on a set of diverse climate questions. Our results point to a significant gap between surface and epistemological qualities of LLMs in the realm of climate communication.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-bulian24a, title = {Assessing Large Language Models on Climate Information}, author = {Bulian, Jannis and Sch\"{a}fer, Mike S. and Amini, Afra and Lam, Heidi and Ciaramita, Massimiliano and Gaiarin, Ben and Chen Huebscher, Michelle and Buck, Christian and Mede, Niels G. and Leippold, Markus and Strauss, Nadine}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {4884--4935}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/bulian24a/bulian24a.pdf}, url = {https://proceedings.mlr.press/v235/bulian24a.html}, abstract = {As Large Language Models (LLMs) rise in popularity, it is necessary to assess their capability in critically relevant domains. We present a comprehensive evaluation framework, grounded in science communication research, to assess LLM responses to questions about climate change. Our framework emphasizes both presentational and epistemological adequacy, offering a fine-grained analysis of LLM generations spanning 8 dimensions and 30 issues. Our evaluation task is a real-world example of a growing number of challenging problems where AI can complement and lift human performance. We introduce a novel protocol for scalable oversight that relies on AI Assistance and raters with relevant education. We evaluate several recent LLMs on a set of diverse climate questions. Our results point to a significant gap between surface and epistemological qualities of LLMs in the realm of climate communication.} }
Endnote
%0 Conference Paper %T Assessing Large Language Models on Climate Information %A Jannis Bulian %A Mike S. Schäfer %A Afra Amini %A Heidi Lam %A Massimiliano Ciaramita %A Ben Gaiarin %A Michelle Chen Huebscher %A Christian Buck %A Niels G. Mede %A Markus Leippold %A Nadine Strauss %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-bulian24a %I PMLR %P 4884--4935 %U https://proceedings.mlr.press/v235/bulian24a.html %V 235 %X As Large Language Models (LLMs) rise in popularity, it is necessary to assess their capability in critically relevant domains. We present a comprehensive evaluation framework, grounded in science communication research, to assess LLM responses to questions about climate change. Our framework emphasizes both presentational and epistemological adequacy, offering a fine-grained analysis of LLM generations spanning 8 dimensions and 30 issues. Our evaluation task is a real-world example of a growing number of challenging problems where AI can complement and lift human performance. We introduce a novel protocol for scalable oversight that relies on AI Assistance and raters with relevant education. We evaluate several recent LLMs on a set of diverse climate questions. Our results point to a significant gap between surface and epistemological qualities of LLMs in the realm of climate communication.
APA
Bulian, J., Schäfer, M.S., Amini, A., Lam, H., Ciaramita, M., Gaiarin, B., Chen Huebscher, M., Buck, C., Mede, N.G., Leippold, M. & Strauss, N.. (2024). Assessing Large Language Models on Climate Information. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:4884-4935 Available from https://proceedings.mlr.press/v235/bulian24a.html.

Related Material