Quantifying Prediction Consistency Under Fine-tuning Multiplicity in Tabular LLMs

Faisal Hamman; Pasan Dissanayake; Saumitra Mishra; Freddy Lecue; Sanghamitra Dutta

Quantifying Prediction Consistency Under Fine-tuning Multiplicity in Tabular LLMs

Faisal Hamman, Pasan Dissanayake, Saumitra Mishra, Freddy Lecue, Sanghamitra Dutta

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:21662-21684, 2025.

Abstract

Fine-tuning LLMs on tabular classification tasks can lead to the phenomenon of fine-tuning multiplicity where equally well-performing models make conflicting predictions on the same input. Fine-tuning multiplicity can arise due to variations in the training process, e.g., seed, weight initialization, minor changes to training data, etc., raising concerns about the reliability of Tabular LLMs in high-stakes applications such as finance, hiring, education, healthcare. Our work formalizes this unique challenge of fine-tuning multiplicity in Tabular LLMs and proposes a novel measure to quantify the consistency of individual predictions without expensive model retraining. Our measure quantifies a prediction’s consistency by analyzing (sampling) the model’s local behavior around that input in the embedding space. Interestingly, we show that sampling in the local neighborhood can be leveraged to provide probabilistic guarantees on prediction consistency under a broad class of fine-tuned models, i.e., inputs with sufficiently high local stability (as defined by our measure) also remain consistent across several fine-tuned models with high probability. We perform experiments on multiple real-world datasets to show that our local stability measure preemptively captures consistency under actual multiplicity across several fine-tuned models, outperforming competing measures.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-hamman25a,
  title = 	 {Quantifying Prediction Consistency Under Fine-tuning Multiplicity in Tabular {LLM}s},
  author =       {Hamman, Faisal and Dissanayake, Pasan and Mishra, Saumitra and Lecue, Freddy and Dutta, Sanghamitra},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {21662--21684},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/hamman25a/hamman25a.pdf},
  url = 	 {https://proceedings.mlr.press/v267/hamman25a.html},
  abstract = 	 {Fine-tuning LLMs on tabular classification tasks can lead to the phenomenon of fine-tuning multiplicity where equally well-performing models make conflicting predictions on the same input. Fine-tuning multiplicity can arise due to variations in the training process, e.g., seed, weight initialization, minor changes to training data, etc., raising concerns about the reliability of Tabular LLMs in high-stakes applications such as finance, hiring, education, healthcare. Our work formalizes this unique challenge of fine-tuning multiplicity in Tabular LLMs and proposes a novel measure to quantify the consistency of individual predictions without expensive model retraining. Our measure quantifies a prediction’s consistency by analyzing (sampling) the model’s local behavior around that input in the embedding space. Interestingly, we show that sampling in the local neighborhood can be leveraged to provide probabilistic guarantees on prediction consistency under a broad class of fine-tuned models, i.e., inputs with sufficiently high local stability (as defined by our measure) also remain consistent across several fine-tuned models with high probability. We perform experiments on multiple real-world datasets to show that our local stability measure preemptively captures consistency under actual multiplicity across several fine-tuned models, outperforming competing measures.}
}

Endnote

%0 Conference Paper
%T Quantifying Prediction Consistency Under Fine-tuning Multiplicity in Tabular LLMs
%A Faisal Hamman
%A Pasan Dissanayake
%A Saumitra Mishra
%A Freddy Lecue
%A Sanghamitra Dutta
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-hamman25a
%I PMLR
%P 21662--21684
%U https://proceedings.mlr.press/v267/hamman25a.html
%V 267
%X Fine-tuning LLMs on tabular classification tasks can lead to the phenomenon of fine-tuning multiplicity where equally well-performing models make conflicting predictions on the same input. Fine-tuning multiplicity can arise due to variations in the training process, e.g., seed, weight initialization, minor changes to training data, etc., raising concerns about the reliability of Tabular LLMs in high-stakes applications such as finance, hiring, education, healthcare. Our work formalizes this unique challenge of fine-tuning multiplicity in Tabular LLMs and proposes a novel measure to quantify the consistency of individual predictions without expensive model retraining. Our measure quantifies a prediction’s consistency by analyzing (sampling) the model’s local behavior around that input in the embedding space. Interestingly, we show that sampling in the local neighborhood can be leveraged to provide probabilistic guarantees on prediction consistency under a broad class of fine-tuned models, i.e., inputs with sufficiently high local stability (as defined by our measure) also remain consistent across several fine-tuned models with high probability. We perform experiments on multiple real-world datasets to show that our local stability measure preemptively captures consistency under actual multiplicity across several fine-tuned models, outperforming competing measures.

APA

Hamman, F., Dissanayake, P., Mishra, S., Lecue, F. & Dutta, S.. (2025). Quantifying Prediction Consistency Under Fine-tuning Multiplicity in Tabular LLMs. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:21662-21684 Available from https://proceedings.mlr.press/v267/hamman25a.html.

Quantifying Prediction Consistency Under Fine-tuning Multiplicity in Tabular LLMs

Abstract

Cite this Paper

Related Material