Finding Influential Training Samples for Gradient Boosted Decision Trees

Boris Sharchilev, Yury Ustinovskiy, Pavel Serdyukov, Maarten Rijke
Proceedings of the 35th International Conference on Machine Learning, PMLR 80:4577-4585, 2018.

Abstract

We address the problem of finding influential training samples for a particular case of tree ensemble-based models, e.g., Random Forest (RF) or Gradient Boosted Decision Trees (GBDT). A natural way of formalizing this problem is studying how the model’s predictions change upon leave-one-out retraining, leaving out each individual training sample. Recent work has shown that, for parametric models, this analysis can be conducted in a computationally efficient way. We propose several ways of extending this framework to non-parametric GBDT ensembles under the assumption that tree structures remain fixed. Furthermore, we introduce a general scheme of obtaining further approximations to our method that balance the trade-off between performance and computational complexity. We evaluate our approaches on various experimental setups and use-case scenarios and demonstrate both the quality of our approach to finding influential training samples in comparison to the baselines and its computational efficiency.

Cite this Paper


BibTeX
@InProceedings{pmlr-v80-sharchilev18a, title = {Finding Influential Training Samples for Gradient Boosted Decision Trees}, author = {Sharchilev, Boris and Ustinovskiy, Yury and Serdyukov, Pavel and de Rijke, Maarten}, booktitle = {Proceedings of the 35th International Conference on Machine Learning}, pages = {4577--4585}, year = {2018}, editor = {Dy, Jennifer and Krause, Andreas}, volume = {80}, series = {Proceedings of Machine Learning Research}, month = {10--15 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v80/sharchilev18a/sharchilev18a.pdf}, url = {https://proceedings.mlr.press/v80/sharchilev18a.html}, abstract = {We address the problem of finding influential training samples for a particular case of tree ensemble-based models, e.g., Random Forest (RF) or Gradient Boosted Decision Trees (GBDT). A natural way of formalizing this problem is studying how the model’s predictions change upon leave-one-out retraining, leaving out each individual training sample. Recent work has shown that, for parametric models, this analysis can be conducted in a computationally efficient way. We propose several ways of extending this framework to non-parametric GBDT ensembles under the assumption that tree structures remain fixed. Furthermore, we introduce a general scheme of obtaining further approximations to our method that balance the trade-off between performance and computational complexity. We evaluate our approaches on various experimental setups and use-case scenarios and demonstrate both the quality of our approach to finding influential training samples in comparison to the baselines and its computational efficiency.} }
Endnote
%0 Conference Paper %T Finding Influential Training Samples for Gradient Boosted Decision Trees %A Boris Sharchilev %A Yury Ustinovskiy %A Pavel Serdyukov %A Maarten Rijke %B Proceedings of the 35th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2018 %E Jennifer Dy %E Andreas Krause %F pmlr-v80-sharchilev18a %I PMLR %P 4577--4585 %U https://proceedings.mlr.press/v80/sharchilev18a.html %V 80 %X We address the problem of finding influential training samples for a particular case of tree ensemble-based models, e.g., Random Forest (RF) or Gradient Boosted Decision Trees (GBDT). A natural way of formalizing this problem is studying how the model’s predictions change upon leave-one-out retraining, leaving out each individual training sample. Recent work has shown that, for parametric models, this analysis can be conducted in a computationally efficient way. We propose several ways of extending this framework to non-parametric GBDT ensembles under the assumption that tree structures remain fixed. Furthermore, we introduce a general scheme of obtaining further approximations to our method that balance the trade-off between performance and computational complexity. We evaluate our approaches on various experimental setups and use-case scenarios and demonstrate both the quality of our approach to finding influential training samples in comparison to the baselines and its computational efficiency.
APA
Sharchilev, B., Ustinovskiy, Y., Serdyukov, P. & Rijke, M.. (2018). Finding Influential Training Samples for Gradient Boosted Decision Trees. Proceedings of the 35th International Conference on Machine Learning, in Proceedings of Machine Learning Research 80:4577-4585 Available from https://proceedings.mlr.press/v80/sharchilev18a.html.

Related Material