Generalization Bounds in the Presence of Outliers: a Median-of-Means Study

Pierre Laforgue; Guillaume Staerman; Stephan Clémençon

Generalization Bounds in the Presence of Outliers: a Median-of-Means Study

Pierre Laforgue, Guillaume Staerman, Stephan Clémençon

Proceedings of the 38th International Conference on Machine Learning, PMLR 139:5937-5947, 2021.

Abstract

In contrast to the empirical mean, the Median-of-Means (MoM) is an estimator of the mean $\theta$ of a square integrable r.v. Z, around which accurate nonasymptotic confidence bounds can be built, even when Z does not exhibit a sub-Gaussian tail behavior. Thanks to the high confidence it achieves on heavy-tailed data, MoM has found various applications in machine learning, where it is used to design training procedures that are not sensitive to atypical observations. More recently, a new line of work is now trying to characterize and leverage MoM’s ability to deal with corrupted data. In this context, the present work proposes a general study of MoM’s concentration properties under the contamination regime, that provides a clear understanding on the impact of the outlier proportion and the number of blocks chosen. The analysis is extended to (multisample) U-statistics, i.e. averages over tuples of observations, that raise additional challenges due to the dependence induced. Finally, we show that the latter bounds can be used in a straightforward fashion to derive generalization guarantees for pairwise learning in a contaminated setting, and propose an algorithm to compute provably reliable decision functions.

Cite this Paper

BibTeX

@InProceedings{pmlr-v139-laforgue21a,
  title = 	 {Generalization Bounds in the Presence of Outliers: a Median-of-Means Study},
  author =       {Laforgue, Pierre and Staerman, Guillaume and Cl{\'e}men{\c{c}}on, Stephan},
  booktitle = 	 {Proceedings of the 38th International Conference on Machine Learning},
  pages = 	 {5937--5947},
  year = 	 {2021},
  editor = 	 {Meila, Marina and Zhang, Tong},
  volume = 	 {139},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {18--24 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v139/laforgue21a/laforgue21a.pdf},
  url = 	 {https://proceedings.mlr.press/v139/laforgue21a.html},
  abstract = 	 {In contrast to the empirical mean, the Median-of-Means (MoM) is an estimator of the mean $\theta$ of a square integrable r.v. Z, around which accurate nonasymptotic confidence bounds can be built, even when Z does not exhibit a sub-Gaussian tail behavior. Thanks to the high confidence it achieves on heavy-tailed data, MoM has found various applications in machine learning, where it is used to design training procedures that are not sensitive to atypical observations. More recently, a new line of work is now trying to characterize and leverage MoM’s ability to deal with corrupted data. In this context, the present work proposes a general study of MoM’s concentration properties under the contamination regime, that provides a clear understanding on the impact of the outlier proportion and the number of blocks chosen. The analysis is extended to (multisample) U-statistics, i.e. averages over tuples of observations, that raise additional challenges due to the dependence induced. Finally, we show that the latter bounds can be used in a straightforward fashion to derive generalization guarantees for pairwise learning in a contaminated setting, and propose an algorithm to compute provably reliable decision functions.}
}

Endnote

%0 Conference Paper
%T Generalization Bounds in the Presence of Outliers: a Median-of-Means Study
%A Pierre Laforgue
%A Guillaume Staerman
%A Stephan Clémençon
%B Proceedings of the 38th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2021
%E Marina Meila
%E Tong Zhang	
%F pmlr-v139-laforgue21a
%I PMLR
%P 5937--5947
%U https://proceedings.mlr.press/v139/laforgue21a.html
%V 139
%X In contrast to the empirical mean, the Median-of-Means (MoM) is an estimator of the mean $\theta$ of a square integrable r.v. Z, around which accurate nonasymptotic confidence bounds can be built, even when Z does not exhibit a sub-Gaussian tail behavior. Thanks to the high confidence it achieves on heavy-tailed data, MoM has found various applications in machine learning, where it is used to design training procedures that are not sensitive to atypical observations. More recently, a new line of work is now trying to characterize and leverage MoM’s ability to deal with corrupted data. In this context, the present work proposes a general study of MoM’s concentration properties under the contamination regime, that provides a clear understanding on the impact of the outlier proportion and the number of blocks chosen. The analysis is extended to (multisample) U-statistics, i.e. averages over tuples of observations, that raise additional challenges due to the dependence induced. Finally, we show that the latter bounds can be used in a straightforward fashion to derive generalization guarantees for pairwise learning in a contaminated setting, and propose an algorithm to compute provably reliable decision functions.

APA

Laforgue, P., Staerman, G. & Clémençon, S.. (2021). Generalization Bounds in the Presence of Outliers: a Median-of-Means Study. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:5937-5947 Available from https://proceedings.mlr.press/v139/laforgue21a.html.

Generalization Bounds in the Presence of Outliers: a Median-of-Means Study

Abstract

Cite this Paper

Related Material