Algorithmic Gaussianization through Sketching: Converting Data into Sub-gaussian Random Designs

Michał Dereziński
Proceedings of Thirty Sixth Conference on Learning Theory, PMLR 195:3137-3172, 2023.

Abstract

Algorithmic Gaussianization is a phenomenon that can arise when using randomized sketching or sampling methods to produce smaller representations of large datasets: For certain tasks, these sketched representations have been observed to exhibit many robust performance characteristics that are known to occur when a data sample comes from a sub-gaussian random design, which is a powerful statistical model of data distributions. However, this phenomenon has only been studied for specific tasks and metrics, or by relying on computationally expensive methods. We address this by providing an algorithmic framework for gaussianizing data using sparse sketching operators, proving that it is possible to efficiently construct data sketches that are nearly indistinguishable (in terms of total variation distance) from sub-gaussian random designs. In particular, relying on a recently introduced sketching technique called Leverage Score Sparsified (LESS) embeddings, we show that one can construct an $n\times d$ sketch of an$N\times d$ matrix $A$, where $n\ll N$, that is nearly indistinguishable from a sub-gaussian design, in time $O(\mathrm{nnz}(A)\log N + nd^2)$, where $\mathrm{nnz}(A)$ is the number of non-zero entries in $A$. As a consequence, strong statistical guarantees and precise asymptotics available for the estimators produced from sub-gaussian designs (e.g., for least squares and Lasso regression, covariance estimation, low-rank approximation, etc.) can be straightforwardly adapted to our sketching framework. We illustrate this with a new approximation guarantee for sketched least squares, among other examples. The key technique that enables our analysis is a novel variant of the Hanson-Wright inequality on the concentration of random quadratic forms, which we establish for random vectors that arise from sparse sketches.

Cite this Paper


BibTeX
@InProceedings{pmlr-v195-derezinski23a, title = {Algorithmic Gaussianization through Sketching: Converting Data into Sub-gaussian Random Designs}, author = {Derezi{\'n}ski, Micha{\l}}, booktitle = {Proceedings of Thirty Sixth Conference on Learning Theory}, pages = {3137--3172}, year = {2023}, editor = {Neu, Gergely and Rosasco, Lorenzo}, volume = {195}, series = {Proceedings of Machine Learning Research}, month = {12--15 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v195/derezinski23a/derezinski23a.pdf}, url = {https://proceedings.mlr.press/v195/derezinski23a.html}, abstract = {Algorithmic Gaussianization is a phenomenon that can arise when using randomized sketching or sampling methods to produce smaller representations of large datasets: For certain tasks, these sketched representations have been observed to exhibit many robust performance characteristics that are known to occur when a data sample comes from a sub-gaussian random design, which is a powerful statistical model of data distributions. However, this phenomenon has only been studied for specific tasks and metrics, or by relying on computationally expensive methods. We address this by providing an algorithmic framework for gaussianizing data using sparse sketching operators, proving that it is possible to efficiently construct data sketches that are nearly indistinguishable (in terms of total variation distance) from sub-gaussian random designs. In particular, relying on a recently introduced sketching technique called Leverage Score Sparsified (LESS) embeddings, we show that one can construct an $n\times d$ sketch of an$N\times d$ matrix $A$, where $n\ll N$, that is nearly indistinguishable from a sub-gaussian design, in time $O(\mathrm{nnz}(A)\log N + nd^2)$, where $\mathrm{nnz}(A)$ is the number of non-zero entries in $A$. As a consequence, strong statistical guarantees and precise asymptotics available for the estimators produced from sub-gaussian designs (e.g., for least squares and Lasso regression, covariance estimation, low-rank approximation, etc.) can be straightforwardly adapted to our sketching framework. We illustrate this with a new approximation guarantee for sketched least squares, among other examples. The key technique that enables our analysis is a novel variant of the Hanson-Wright inequality on the concentration of random quadratic forms, which we establish for random vectors that arise from sparse sketches.} }
Endnote
%0 Conference Paper %T Algorithmic Gaussianization through Sketching: Converting Data into Sub-gaussian Random Designs %A Michał Dereziński %B Proceedings of Thirty Sixth Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2023 %E Gergely Neu %E Lorenzo Rosasco %F pmlr-v195-derezinski23a %I PMLR %P 3137--3172 %U https://proceedings.mlr.press/v195/derezinski23a.html %V 195 %X Algorithmic Gaussianization is a phenomenon that can arise when using randomized sketching or sampling methods to produce smaller representations of large datasets: For certain tasks, these sketched representations have been observed to exhibit many robust performance characteristics that are known to occur when a data sample comes from a sub-gaussian random design, which is a powerful statistical model of data distributions. However, this phenomenon has only been studied for specific tasks and metrics, or by relying on computationally expensive methods. We address this by providing an algorithmic framework for gaussianizing data using sparse sketching operators, proving that it is possible to efficiently construct data sketches that are nearly indistinguishable (in terms of total variation distance) from sub-gaussian random designs. In particular, relying on a recently introduced sketching technique called Leverage Score Sparsified (LESS) embeddings, we show that one can construct an $n\times d$ sketch of an$N\times d$ matrix $A$, where $n\ll N$, that is nearly indistinguishable from a sub-gaussian design, in time $O(\mathrm{nnz}(A)\log N + nd^2)$, where $\mathrm{nnz}(A)$ is the number of non-zero entries in $A$. As a consequence, strong statistical guarantees and precise asymptotics available for the estimators produced from sub-gaussian designs (e.g., for least squares and Lasso regression, covariance estimation, low-rank approximation, etc.) can be straightforwardly adapted to our sketching framework. We illustrate this with a new approximation guarantee for sketched least squares, among other examples. The key technique that enables our analysis is a novel variant of the Hanson-Wright inequality on the concentration of random quadratic forms, which we establish for random vectors that arise from sparse sketches.
APA
Dereziński, M.. (2023). Algorithmic Gaussianization through Sketching: Converting Data into Sub-gaussian Random Designs. Proceedings of Thirty Sixth Conference on Learning Theory, in Proceedings of Machine Learning Research 195:3137-3172 Available from https://proceedings.mlr.press/v195/derezinski23a.html.

Related Material