Hierarchy-Aware Supervised Uncertainty Estimation for Black-box LLM Taxonomic Reasoning

Shuting Xie, Nathaniel Lesperance, Graham Taylor
Proceedings of the The 39th Canadian Conference on Artificial Intelligence, PMLR 318:997-1004, 2026.

Abstract

Large language models (LLMs) are increasingly used for scientific decision support, yet reliable confidence estimation remains difficult in black-box settings. We study uncertainty estimation for hierarchical taxonomic reasoning generated by a black-box LLM in a long-tailed biodiversity monitoring pipeline. Using proxy features extracted by an open-source tool LLM, we train lightweight supervised estimators with hierarchy-aware supervision to predict rank-wise correctness. Across three tool LLMs, the supervised estimators consistently outperform a token-likelihood baseline for micro discrimination and selective prediction under a single global rejection threshold, improving micro AUROC from 0.57 to 0.75-0.80. The best results are achieved by a rank-specific multi-head design (H3), suggesting that accounting for hierarchical output structure is important when a unified abstention rule is required.

Cite this Paper


BibTeX
@InProceedings{pmlr-v318-xie26a, title = {Hierarchy-Aware Supervised Uncertainty Estimation for Black-box LLM Taxonomic Reasoning}, author = {Xie, Shuting and Lesperance, Nathaniel and Taylor, Graham}, booktitle = {Proceedings of the The 39th Canadian Conference on Artificial Intelligence}, pages = {997--1004}, year = {2026}, editor = {Bouzar-Benlabiod, Lydia and Leung, Carson}, volume = {318}, series = {Proceedings of Machine Learning Research}, month = {25--29 May}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v318/main/assets/xie26a/xie26a.pdf}, url = {https://proceedings.mlr.press/v318/xie26a.html}, abstract = {Large language models (LLMs) are increasingly used for scientific decision support, yet reliable confidence estimation remains difficult in black-box settings. We study uncertainty estimation for hierarchical taxonomic reasoning generated by a black-box LLM in a long-tailed biodiversity monitoring pipeline. Using proxy features extracted by an open-source tool LLM, we train lightweight supervised estimators with hierarchy-aware supervision to predict rank-wise correctness. Across three tool LLMs, the supervised estimators consistently outperform a token-likelihood baseline for micro discrimination and selective prediction under a single global rejection threshold, improving micro AUROC from 0.57 to 0.75-0.80. The best results are achieved by a rank-specific multi-head design (H3), suggesting that accounting for hierarchical output structure is important when a unified abstention rule is required.} }
Endnote
%0 Conference Paper %T Hierarchy-Aware Supervised Uncertainty Estimation for Black-box LLM Taxonomic Reasoning %A Shuting Xie %A Nathaniel Lesperance %A Graham Taylor %B Proceedings of the The 39th Canadian Conference on Artificial Intelligence %C Proceedings of Machine Learning Research %D 2026 %E Lydia Bouzar-Benlabiod %E Carson Leung %F pmlr-v318-xie26a %I PMLR %P 997--1004 %U https://proceedings.mlr.press/v318/xie26a.html %V 318 %X Large language models (LLMs) are increasingly used for scientific decision support, yet reliable confidence estimation remains difficult in black-box settings. We study uncertainty estimation for hierarchical taxonomic reasoning generated by a black-box LLM in a long-tailed biodiversity monitoring pipeline. Using proxy features extracted by an open-source tool LLM, we train lightweight supervised estimators with hierarchy-aware supervision to predict rank-wise correctness. Across three tool LLMs, the supervised estimators consistently outperform a token-likelihood baseline for micro discrimination and selective prediction under a single global rejection threshold, improving micro AUROC from 0.57 to 0.75-0.80. The best results are achieved by a rank-specific multi-head design (H3), suggesting that accounting for hierarchical output structure is important when a unified abstention rule is required.
APA
Xie, S., Lesperance, N. & Taylor, G.. (2026). Hierarchy-Aware Supervised Uncertainty Estimation for Black-box LLM Taxonomic Reasoning. Proceedings of the The 39th Canadian Conference on Artificial Intelligence, in Proceedings of Machine Learning Research 318:997-1004 Available from https://proceedings.mlr.press/v318/xie26a.html.

Related Material