Outlier Gradient Analysis: Efficiently Identifying Detrimental Training Samples for Deep Learning Models

Anshuman Chhabra, Bo Li, Jian Chen, Prasant Mohapatra, Hongfu Liu
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:10334-10353, 2025.

Abstract

A core data-centric learning challenge is the identification of training samples that are detrimental to model performance. Influence functions serve as a prominent tool for this task and offer a robust framework for assessing training data influence on model predictions. Despite their widespread use, their high computational cost associated with calculating the inverse of the Hessian matrix pose constraints, particularly when analyzing large-sized deep models. In this paper, we establish a bridge between identifying detrimental training samples via influence functions and outlier gradient detection. This transformation not only presents a straightforward and Hessian-free formulation but also provides insights into the role of the gradient in sample impact. Through systematic empirical evaluations, we first validate the hypothesis of our proposed outlier gradient analysis approach on synthetic datasets. We then demonstrate its effectiveness in detecting mislabeled samples in vision models and selecting data samples for improving performance of natural language processing transformer models. We also extend its use to influential sample identification for fine-tuning Large Language Models.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-chhabra25a, title = {Outlier Gradient Analysis: Efficiently Identifying Detrimental Training Samples for Deep Learning Models}, author = {Chhabra, Anshuman and Li, Bo and Chen, Jian and Mohapatra, Prasant and Liu, Hongfu}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {10334--10353}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/chhabra25a/chhabra25a.pdf}, url = {https://proceedings.mlr.press/v267/chhabra25a.html}, abstract = {A core data-centric learning challenge is the identification of training samples that are detrimental to model performance. Influence functions serve as a prominent tool for this task and offer a robust framework for assessing training data influence on model predictions. Despite their widespread use, their high computational cost associated with calculating the inverse of the Hessian matrix pose constraints, particularly when analyzing large-sized deep models. In this paper, we establish a bridge between identifying detrimental training samples via influence functions and outlier gradient detection. This transformation not only presents a straightforward and Hessian-free formulation but also provides insights into the role of the gradient in sample impact. Through systematic empirical evaluations, we first validate the hypothesis of our proposed outlier gradient analysis approach on synthetic datasets. We then demonstrate its effectiveness in detecting mislabeled samples in vision models and selecting data samples for improving performance of natural language processing transformer models. We also extend its use to influential sample identification for fine-tuning Large Language Models.} }
Endnote
%0 Conference Paper %T Outlier Gradient Analysis: Efficiently Identifying Detrimental Training Samples for Deep Learning Models %A Anshuman Chhabra %A Bo Li %A Jian Chen %A Prasant Mohapatra %A Hongfu Liu %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-chhabra25a %I PMLR %P 10334--10353 %U https://proceedings.mlr.press/v267/chhabra25a.html %V 267 %X A core data-centric learning challenge is the identification of training samples that are detrimental to model performance. Influence functions serve as a prominent tool for this task and offer a robust framework for assessing training data influence on model predictions. Despite their widespread use, their high computational cost associated with calculating the inverse of the Hessian matrix pose constraints, particularly when analyzing large-sized deep models. In this paper, we establish a bridge between identifying detrimental training samples via influence functions and outlier gradient detection. This transformation not only presents a straightforward and Hessian-free formulation but also provides insights into the role of the gradient in sample impact. Through systematic empirical evaluations, we first validate the hypothesis of our proposed outlier gradient analysis approach on synthetic datasets. We then demonstrate its effectiveness in detecting mislabeled samples in vision models and selecting data samples for improving performance of natural language processing transformer models. We also extend its use to influential sample identification for fine-tuning Large Language Models.
APA
Chhabra, A., Li, B., Chen, J., Mohapatra, P. & Liu, H.. (2025). Outlier Gradient Analysis: Efficiently Identifying Detrimental Training Samples for Deep Learning Models. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:10334-10353 Available from https://proceedings.mlr.press/v267/chhabra25a.html.

Related Material