Estimation and Inference with Trees and Forests in High Dimensions

Vasilis Syrgkanis, Manolis Zampetakis
; Proceedings of Thirty Third Conference on Learning Theory, PMLR 125:3453-3454, 2020.

Abstract

Regression Trees [Breiman et al. 1984] and Random Forests [Breiman 2001], are one of the most widely used estimation methods by machine learning practitioners. Despite their widespread use, their theoretical underpinnings are far from being fully understood. Recent breakthrough advances have shown that such greedily built trees are asymptotically consistent [Biau et al. 2010, Denil et al. 2014, Scornet et al. 2015] in the low dimensional regime, where the number of features is a constant, independent of the sample size. Also, the works of [Mentch et al. 2016, Wager et al. 2018] provide asymptotic normality results for honest versions of Random Forests. In this work, we analyze the performance of regression trees and forests with binary features in the high-dimensional regime, where the number of features can grow exponentially with the number of samples. We show that trees and forests built greedily based on the celebrated CART criterion, provably adapt to sparsity: when only a subset $R$, of size $r$, of the features are relevant, then the mean squared error of appropriately shallow trees, or fully grown honest forests, scales exponentially only with the number of relevant features and depends only logarithmically on the overall number of features. More precisely, we identify three regimes, each providing different dependence on the number of relevant features. When the relevant variables are “weakly” relevant (in the sense that there is not strong separation between the relevant and irrelevant variables in terms of their ability to reduce variance), then shallow trees achieve “slow rates” on the mean squared error of the order of $2^r/\sqrt{n}$, when variables are independent, and $1/n^{1/(r+2)}$, when variables are dependent. When the relevant variables are “strongly” relevant, in that there is a separation in their ability to reduce variance as compared to the irrelevant ones, by a constant $\beta_{\min}$, then we show that greedily built shallow trees and fully grown honest forests can achieve fast parametric mean squared error rates of the order of $2^r/(\beta_{\min}\,{n})$. When variables are strongly relevant, we also show that the predictions of sub-sampled honest forests have an asymptotically normal distribution centered around their true values and whose variance scales at most as $O(2^r \log(n)/(\beta_{\min}\,{n}))$. Thus, sub-sampled honest forests are provably a data-adaptive method for non-parametric inference, that adapts to the latent sparsity dimension of the data generating distribution, as opposed to classical non-parametric regression approaches. Our results show that, at least for the case of binary features, forest based algorithms can offer immense improvement on the statistical power of non-parametric hypothesis tests in high-dimensional regimes.

Cite this Paper


BibTeX
@InProceedings{pmlr-v125-syrgkanis20a, title = {Estimation and Inference with Trees and Forests in High Dimensions}, author = {Syrgkanis, Vasilis and Zampetakis, Manolis}, pages = {3453--3454}, year = {2020}, editor = {Jacob Abernethy and Shivani Agarwal}, volume = {125}, series = {Proceedings of Machine Learning Research}, address = {}, month = {09--12 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v125/syrgkanis20a/syrgkanis20a.pdf}, url = {http://proceedings.mlr.press/v125/syrgkanis20a.html}, abstract = { Regression Trees [Breiman et al. 1984] and Random Forests [Breiman 2001], are one of the most widely used estimation methods by machine learning practitioners. Despite their widespread use, their theoretical underpinnings are far from being fully understood. Recent breakthrough advances have shown that such greedily built trees are asymptotically consistent [Biau et al. 2010, Denil et al. 2014, Scornet et al. 2015] in the low dimensional regime, where the number of features is a constant, independent of the sample size. Also, the works of [Mentch et al. 2016, Wager et al. 2018] provide asymptotic normality results for honest versions of Random Forests. In this work, we analyze the performance of regression trees and forests with binary features in the high-dimensional regime, where the number of features can grow exponentially with the number of samples. We show that trees and forests built greedily based on the celebrated CART criterion, provably adapt to sparsity: when only a subset $R$, of size $r$, of the features are relevant, then the mean squared error of appropriately shallow trees, or fully grown honest forests, scales exponentially only with the number of relevant features and depends only logarithmically on the overall number of features. More precisely, we identify three regimes, each providing different dependence on the number of relevant features. When the relevant variables are “weakly” relevant (in the sense that there is not strong separation between the relevant and irrelevant variables in terms of their ability to reduce variance), then shallow trees achieve “slow rates” on the mean squared error of the order of $2^r/\sqrt{n}$, when variables are independent, and $1/n^{1/(r+2)}$, when variables are dependent. When the relevant variables are “strongly” relevant, in that there is a separation in their ability to reduce variance as compared to the irrelevant ones, by a constant $\beta_{\min}$, then we show that greedily built shallow trees and fully grown honest forests can achieve fast parametric mean squared error rates of the order of $2^r/(\beta_{\min}\,{n})$. When variables are strongly relevant, we also show that the predictions of sub-sampled honest forests have an asymptotically normal distribution centered around their true values and whose variance scales at most as $O(2^r \log(n)/(\beta_{\min}\,{n}))$. Thus, sub-sampled honest forests are provably a data-adaptive method for non-parametric inference, that adapts to the latent sparsity dimension of the data generating distribution, as opposed to classical non-parametric regression approaches. Our results show that, at least for the case of binary features, forest based algorithms can offer immense improvement on the statistical power of non-parametric hypothesis tests in high-dimensional regimes. } }
Endnote
%0 Conference Paper %T Estimation and Inference with Trees and Forests in High Dimensions %A Vasilis Syrgkanis %A Manolis Zampetakis %B Proceedings of Thirty Third Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2020 %E Jacob Abernethy %E Shivani Agarwal %F pmlr-v125-syrgkanis20a %I PMLR %J Proceedings of Machine Learning Research %P 3453--3454 %U http://proceedings.mlr.press %V 125 %W PMLR %X Regression Trees [Breiman et al. 1984] and Random Forests [Breiman 2001], are one of the most widely used estimation methods by machine learning practitioners. Despite their widespread use, their theoretical underpinnings are far from being fully understood. Recent breakthrough advances have shown that such greedily built trees are asymptotically consistent [Biau et al. 2010, Denil et al. 2014, Scornet et al. 2015] in the low dimensional regime, where the number of features is a constant, independent of the sample size. Also, the works of [Mentch et al. 2016, Wager et al. 2018] provide asymptotic normality results for honest versions of Random Forests. In this work, we analyze the performance of regression trees and forests with binary features in the high-dimensional regime, where the number of features can grow exponentially with the number of samples. We show that trees and forests built greedily based on the celebrated CART criterion, provably adapt to sparsity: when only a subset $R$, of size $r$, of the features are relevant, then the mean squared error of appropriately shallow trees, or fully grown honest forests, scales exponentially only with the number of relevant features and depends only logarithmically on the overall number of features. More precisely, we identify three regimes, each providing different dependence on the number of relevant features. When the relevant variables are “weakly” relevant (in the sense that there is not strong separation between the relevant and irrelevant variables in terms of their ability to reduce variance), then shallow trees achieve “slow rates” on the mean squared error of the order of $2^r/\sqrt{n}$, when variables are independent, and $1/n^{1/(r+2)}$, when variables are dependent. When the relevant variables are “strongly” relevant, in that there is a separation in their ability to reduce variance as compared to the irrelevant ones, by a constant $\beta_{\min}$, then we show that greedily built shallow trees and fully grown honest forests can achieve fast parametric mean squared error rates of the order of $2^r/(\beta_{\min}\,{n})$. When variables are strongly relevant, we also show that the predictions of sub-sampled honest forests have an asymptotically normal distribution centered around their true values and whose variance scales at most as $O(2^r \log(n)/(\beta_{\min}\,{n}))$. Thus, sub-sampled honest forests are provably a data-adaptive method for non-parametric inference, that adapts to the latent sparsity dimension of the data generating distribution, as opposed to classical non-parametric regression approaches. Our results show that, at least for the case of binary features, forest based algorithms can offer immense improvement on the statistical power of non-parametric hypothesis tests in high-dimensional regimes.
APA
Syrgkanis, V. & Zampetakis, M.. (2020). Estimation and Inference with Trees and Forests in High Dimensions. Proceedings of Thirty Third Conference on Learning Theory, in PMLR 125:3453-3454

Related Material