Analysis of Bootstrap and Subsampling in High-dimensional Regularized Regression

Lucas Clarté, Adrien Vandenbroucque, Guillaume Dalle, Bruno Loureiro, Florent Krzakala, Lenka Zdeborová
Proceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence, PMLR 244:787-819, 2024.

Abstract

We investigate popular resampling methods for estimating the uncertainty of statistical models, such as subsampling, bootstrap and the jackknife, and their performance in high-dimensional supervised regression tasks. We provide a tight asymptotic description of the biases and variances estimated by these methods in the context of generalized linear models, such as ridge and logistic regression, taking the limit where the number of samples $n$ and dimension $d$ of the covariates grow at a comparable rate: $\alpha=n/d$ fixed. Our findings are three-fold: i) resampling methods are fraught with problems in high dimensions and exhibit the double-descent-like behavior typical of these situations; ii) only when $\alpha$ is large enough do they provide consistent and reliable error estimations (we give convergence rates); iii) in the over-parametrized regime $\alpha<1$ relevant to modern machine learning practice, their predictions are not consistent, even with optimal regularization.

Cite this Paper


BibTeX
@InProceedings{pmlr-v244-clarte24a, title = {Analysis of Bootstrap and Subsampling in High-dimensional Regularized Regression}, author = {Clart\'e, Lucas and Vandenbroucque, Adrien and Dalle, Guillaume and Loureiro, Bruno and Krzakala, Florent and Zdeborov\'a, Lenka}, booktitle = {Proceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence}, pages = {787--819}, year = {2024}, editor = {Kiyavash, Negar and Mooij, Joris M.}, volume = {244}, series = {Proceedings of Machine Learning Research}, month = {15--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v244/main/assets/clarte24a/clarte24a.pdf}, url = {https://proceedings.mlr.press/v244/clarte24a.html}, abstract = {We investigate popular resampling methods for estimating the uncertainty of statistical models, such as subsampling, bootstrap and the jackknife, and their performance in high-dimensional supervised regression tasks. We provide a tight asymptotic description of the biases and variances estimated by these methods in the context of generalized linear models, such as ridge and logistic regression, taking the limit where the number of samples $n$ and dimension $d$ of the covariates grow at a comparable rate: $\alpha=n/d$ fixed. Our findings are three-fold: i) resampling methods are fraught with problems in high dimensions and exhibit the double-descent-like behavior typical of these situations; ii) only when $\alpha$ is large enough do they provide consistent and reliable error estimations (we give convergence rates); iii) in the over-parametrized regime $\alpha<1$ relevant to modern machine learning practice, their predictions are not consistent, even with optimal regularization.} }
Endnote
%0 Conference Paper %T Analysis of Bootstrap and Subsampling in High-dimensional Regularized Regression %A Lucas Clarté %A Adrien Vandenbroucque %A Guillaume Dalle %A Bruno Loureiro %A Florent Krzakala %A Lenka Zdeborová %B Proceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence %C Proceedings of Machine Learning Research %D 2024 %E Negar Kiyavash %E Joris M. Mooij %F pmlr-v244-clarte24a %I PMLR %P 787--819 %U https://proceedings.mlr.press/v244/clarte24a.html %V 244 %X We investigate popular resampling methods for estimating the uncertainty of statistical models, such as subsampling, bootstrap and the jackknife, and their performance in high-dimensional supervised regression tasks. We provide a tight asymptotic description of the biases and variances estimated by these methods in the context of generalized linear models, such as ridge and logistic regression, taking the limit where the number of samples $n$ and dimension $d$ of the covariates grow at a comparable rate: $\alpha=n/d$ fixed. Our findings are three-fold: i) resampling methods are fraught with problems in high dimensions and exhibit the double-descent-like behavior typical of these situations; ii) only when $\alpha$ is large enough do they provide consistent and reliable error estimations (we give convergence rates); iii) in the over-parametrized regime $\alpha<1$ relevant to modern machine learning practice, their predictions are not consistent, even with optimal regularization.
APA
Clarté, L., Vandenbroucque, A., Dalle, G., Loureiro, B., Krzakala, F. & Zdeborová, L.. (2024). Analysis of Bootstrap and Subsampling in High-dimensional Regularized Regression. Proceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence, in Proceedings of Machine Learning Research 244:787-819 Available from https://proceedings.mlr.press/v244/clarte24a.html.

Related Material