Hierarchy-Aware Supervised Uncertainty Estimation for Black-box LLM Taxonomic Reasoning

Shuting Xie; Nathaniel Lesperance; Graham Taylor

Hierarchy-Aware Supervised Uncertainty Estimation for Black-box LLM Taxonomic Reasoning

Shuting Xie, Nathaniel Lesperance, Graham Taylor

Proceedings of the The 39th Canadian Conference on Artificial Intelligence, PMLR 318:997-1004, 2026.

Abstract

Large language models (LLMs) are increasingly used for scientific decision support, yet reliable confidence estimation remains difficult in black-box settings. We study uncertainty estimation for hierarchical taxonomic reasoning generated by a black-box LLM in a long-tailed biodiversity monitoring pipeline. Using proxy features extracted by an open-source tool LLM, we train lightweight supervised estimators with hierarchy-aware supervision to predict rank-wise correctness. Across three tool LLMs, the supervised estimators consistently outperform a token-likelihood baseline for micro discrimination and selective prediction under a single global rejection threshold, improving micro AUROC from 0.57 to 0.75-0.80. The best results are achieved by a rank-specific multi-head design (H3), suggesting that accounting for hierarchical output structure is important when a unified abstention rule is required.

Cite this Paper

BibTeX

@InProceedings{pmlr-v318-xie26a,
  title = 	 {Hierarchy-Aware Supervised Uncertainty Estimation for Black-box LLM Taxonomic Reasoning},
  author =       {Xie, Shuting and Lesperance, Nathaniel and Taylor, Graham},
  booktitle = 	 {Proceedings of the The 39th Canadian Conference on Artificial Intelligence},
  pages = 	 {997--1004},
  year = 	 {2026},
  editor = 	 {Bouzar-Benlabiod, Lydia and Leung, Carson},
  volume = 	 {318},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {25--29 May},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v318/main/assets/xie26a/xie26a.pdf},
  url = 	 {https://proceedings.mlr.press/v318/xie26a.html},
  abstract = 	 {Large language models (LLMs) are increasingly used for scientific decision support, yet reliable confidence estimation remains difficult in black-box settings. We study uncertainty estimation for hierarchical taxonomic reasoning generated by a black-box LLM in a long-tailed biodiversity monitoring pipeline. Using proxy features extracted by an open-source tool LLM, we train lightweight supervised estimators with hierarchy-aware supervision to predict rank-wise correctness. Across three tool LLMs, the supervised estimators consistently outperform a token-likelihood baseline for micro discrimination and selective prediction under a single global rejection threshold, improving micro AUROC from 0.57 to 0.75-0.80. The best results are achieved by a rank-specific multi-head design (H3), suggesting that accounting for hierarchical output structure is important when a unified abstention rule is required.}
}

Endnote

%0 Conference Paper
%T Hierarchy-Aware Supervised Uncertainty Estimation for Black-box LLM Taxonomic Reasoning
%A Shuting Xie
%A Nathaniel Lesperance
%A Graham Taylor
%B Proceedings of the The 39th Canadian Conference on Artificial Intelligence
%C Proceedings of Machine Learning Research
%D 2026
%E Lydia Bouzar-Benlabiod
%E Carson Leung	
%F pmlr-v318-xie26a
%I PMLR
%P 997--1004
%U https://proceedings.mlr.press/v318/xie26a.html
%V 318
%X Large language models (LLMs) are increasingly used for scientific decision support, yet reliable confidence estimation remains difficult in black-box settings. We study uncertainty estimation for hierarchical taxonomic reasoning generated by a black-box LLM in a long-tailed biodiversity monitoring pipeline. Using proxy features extracted by an open-source tool LLM, we train lightweight supervised estimators with hierarchy-aware supervision to predict rank-wise correctness. Across three tool LLMs, the supervised estimators consistently outperform a token-likelihood baseline for micro discrimination and selective prediction under a single global rejection threshold, improving micro AUROC from 0.57 to 0.75-0.80. The best results are achieved by a rank-specific multi-head design (H3), suggesting that accounting for hierarchical output structure is important when a unified abstention rule is required.

APA

Xie, S., Lesperance, N. & Taylor, G.. (2026). Hierarchy-Aware Supervised Uncertainty Estimation for Black-box LLM Taxonomic Reasoning. Proceedings of the The 39th Canadian Conference on Artificial Intelligence, in Proceedings of Machine Learning Research 318:997-1004 Available from https://proceedings.mlr.press/v318/xie26a.html.

Related Material

Download PDF