The Cloud-Based Geospatial Benchmark: Challenges and LLM Evaluation

Jeffrey A. Cardille, Renee Johnston, Simon Ilyushchenko, Johan Kartiwa, Zahra Shamsi, Matthew Abraham, Khashayar Azad, Kainath Ahmed, Emma Bergeron Quick, Nuala Caughie, Noah Jencz, Karen Dyson, Andrea Puzzi Nicolau, Maria Fernanda Lopez-Ornelas, David Saah, Michael Brenner, Subhashini Venugopalan, Sameera S Ponda
Proceedings of The TerraBytes {ICML} Workshop: Towards global datasets and models for Earth Observation, PMLR 292:63-80, 2025.

Abstract

With the increasing skill and adoption of Large Language Models (LLMs) in the sciences, evaluating their capability in a wide variety of application domains is crucial. This work focuses on evaluating LLM-based agents on Earth Observation tasks, particularly those involving the analysis of satellite imagery and geospatial data. We introduce the Cloud-Based Geospatial Benchmark (CBGB), a set of challenges designed to measure how well LLMs can generate code to provide short numerical answers to 45 practical scenarios in geography and environmental science. While the benchmark questions are framed to assess broadly applicable geospatial data analysis skills, their implementation is most readily achieved using the extensive data catalogs and powerful APIs of platforms like Earth Engine. The questions and reference solutions in CBGB were curated from experts with both domain familiarity in Earth Observation and programming expertise. We also estimate and include the difficulty of each problem. We evaluate the performance of frontier LLMs on these tasks with and without access to an execution environment for error-correction based feedback. Using the benchmark we assess how LLMs operate on practical Earth Observation questions across a range of difficulty levels. We find that models with the error-correction feedback, which mirrors the iterative development process common in geospatial analyses, tend to perform consistently better with the highest performance at 71%; the reasoning variants of models outperformed the non-thinking versions. We also share detailed guidelines on curating such practical scenarios and assessing their ability to evaluate agents in the geospatial domain. The benchmark and evaluation code are available on Github \url{https://github.com/google/earthengine-community/tree/master/experimental/cbgb_benchmark}.

Cite this Paper


BibTeX
@InProceedings{pmlr-v292-cardille25a, title = {The Cloud-Based Geospatial Benchmark: Challenges and {LLM} Evaluation}, author = {Cardille, Jeffrey A. and Johnston, Renee and Ilyushchenko, Simon and Kartiwa, Johan and Shamsi, Zahra and Abraham, Matthew and Azad, Khashayar and Ahmed, Kainath and {Bergeron Quick}, Emma and Caughie, Nuala and Jencz, Noah and Dyson, Karen and {Puzzi Nicolau}, Andrea and Lopez-Ornelas, Maria Fernanda and Saah, David and Brenner, Michael and Venugopalan, Subhashini and Ponda, Sameera S}, booktitle = {Proceedings of The TerraBytes {ICML} Workshop: Towards global datasets and models for Earth Observation}, pages = {63--80}, year = {2025}, editor = {Audebert, Nicolas and Azizpour, Hossein and Barrière, Valentin and Castillo Navarro, Javiera and Czerkawski, Mikolaj and Fang, Heng and Francis, Alistair and Marsocci, Valerio and Nascetti, Andrea and Yadav, Ritu}, volume = {292}, series = {Proceedings of Machine Learning Research}, month = {19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v292/main/assets/cardille25a/cardille25a.pdf}, url = {https://proceedings.mlr.press/v292/cardille25a.html}, abstract = {With the increasing skill and adoption of Large Language Models (LLMs) in the sciences, evaluating their capability in a wide variety of application domains is crucial. This work focuses on evaluating LLM-based agents on Earth Observation tasks, particularly those involving the analysis of satellite imagery and geospatial data. We introduce the Cloud-Based Geospatial Benchmark (CBGB), a set of challenges designed to measure how well LLMs can generate code to provide short numerical answers to 45 practical scenarios in geography and environmental science. While the benchmark questions are framed to assess broadly applicable geospatial data analysis skills, their implementation is most readily achieved using the extensive data catalogs and powerful APIs of platforms like Earth Engine. The questions and reference solutions in CBGB were curated from experts with both domain familiarity in Earth Observation and programming expertise. We also estimate and include the difficulty of each problem. We evaluate the performance of frontier LLMs on these tasks with and without access to an execution environment for error-correction based feedback. Using the benchmark we assess how LLMs operate on practical Earth Observation questions across a range of difficulty levels. We find that models with the error-correction feedback, which mirrors the iterative development process common in geospatial analyses, tend to perform consistently better with the highest performance at 71%; the reasoning variants of models outperformed the non-thinking versions. We also share detailed guidelines on curating such practical scenarios and assessing their ability to evaluate agents in the geospatial domain. The benchmark and evaluation code are available on Github \url{https://github.com/google/earthengine-community/tree/master/experimental/cbgb_benchmark}.} }
Endnote
%0 Conference Paper %T The Cloud-Based Geospatial Benchmark: Challenges and LLM Evaluation %A Jeffrey A. Cardille %A Renee Johnston %A Simon Ilyushchenko %A Johan Kartiwa %A Zahra Shamsi %A Matthew Abraham %A Khashayar Azad %A Kainath Ahmed %A Emma Bergeron Quick %A Nuala Caughie %A Noah Jencz %A Karen Dyson %A Andrea Puzzi Nicolau %A Maria Fernanda Lopez-Ornelas %A David Saah %A Michael Brenner %A Subhashini Venugopalan %A Sameera S Ponda %B Proceedings of The TerraBytes {ICML} Workshop: Towards global datasets and models for Earth Observation %C Proceedings of Machine Learning Research %D 2025 %E Nicolas Audebert %E Hossein Azizpour %E Valentin Barrière %E Javiera Castillo Navarro %E Mikolaj Czerkawski %E Heng Fang %E Alistair Francis %E Valerio Marsocci %E Andrea Nascetti %E Ritu Yadav %F pmlr-v292-cardille25a %I PMLR %P 63--80 %U https://proceedings.mlr.press/v292/cardille25a.html %V 292 %X With the increasing skill and adoption of Large Language Models (LLMs) in the sciences, evaluating their capability in a wide variety of application domains is crucial. This work focuses on evaluating LLM-based agents on Earth Observation tasks, particularly those involving the analysis of satellite imagery and geospatial data. We introduce the Cloud-Based Geospatial Benchmark (CBGB), a set of challenges designed to measure how well LLMs can generate code to provide short numerical answers to 45 practical scenarios in geography and environmental science. While the benchmark questions are framed to assess broadly applicable geospatial data analysis skills, their implementation is most readily achieved using the extensive data catalogs and powerful APIs of platforms like Earth Engine. The questions and reference solutions in CBGB were curated from experts with both domain familiarity in Earth Observation and programming expertise. We also estimate and include the difficulty of each problem. We evaluate the performance of frontier LLMs on these tasks with and without access to an execution environment for error-correction based feedback. Using the benchmark we assess how LLMs operate on practical Earth Observation questions across a range of difficulty levels. We find that models with the error-correction feedback, which mirrors the iterative development process common in geospatial analyses, tend to perform consistently better with the highest performance at 71%; the reasoning variants of models outperformed the non-thinking versions. We also share detailed guidelines on curating such practical scenarios and assessing their ability to evaluate agents in the geospatial domain. The benchmark and evaluation code are available on Github \url{https://github.com/google/earthengine-community/tree/master/experimental/cbgb_benchmark}.
APA
Cardille, J.A., Johnston, R., Ilyushchenko, S., Kartiwa, J., Shamsi, Z., Abraham, M., Azad, K., Ahmed, K., Bergeron Quick, E., Caughie, N., Jencz, N., Dyson, K., Puzzi Nicolau, A., Lopez-Ornelas, M.F., Saah, D., Brenner, M., Venugopalan, S. & Ponda, S.S.. (2025). The Cloud-Based Geospatial Benchmark: Challenges and LLM Evaluation. Proceedings of The TerraBytes {ICML} Workshop: Towards global datasets and models for Earth Observation, in Proceedings of Machine Learning Research 292:63-80 Available from https://proceedings.mlr.press/v292/cardille25a.html.

Related Material