Streaming and Distributed Algorithms for Robust Column Subset Selection

Shuli Jiang, Dennis Li, Irene Mengze Li, Arvind V Mahankali, David Woodruff
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:4971-4981, 2021.


We give the first single-pass streaming algorithm for Column Subset Selection with respect to the entrywise p-norm with 1p<2. We study the p norm loss since it is often considered more robust to noise than the standard Frobenius norm. Given an input matrix ARd×n (nd), our algorithm achieves a multiplicative k1p12\poly(lognd)-approximation to the error with respect to the \textit{best possible column subset} of size k. Furthermore, the space complexity of the streaming algorithm is optimal up to a logarithmic factor. Our streaming algorithm also extends naturally to a 1-round distributed protocol with nearly optimal communication cost. A key ingredient in our algorithms is a reduction to column subset selection in the p,2-norm, which corresponds to the p-norm of the vector of Euclidean norms of each of the columns of A. This enables us to leverage strong coreset constructions for the Euclidean norm, which previously had not been applied in this context. We also give the first provable guarantees for greedy column subset selection in the 1,2 norm, which can be used as an alternative, practical subroutine in our algorithms. Finally, we show that our algorithms give significant practical advantages on real-world data analysis tasks.

Cite this Paper

@InProceedings{pmlr-v139-jiang21e, title = {Streaming and Distributed Algorithms for Robust Column Subset Selection}, author = {Jiang, Shuli and Li, Dennis and Li, Irene Mengze and Mahankali, Arvind V and Woodruff, David}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {4971--4981}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {}, url = {}, abstract = {We give the first single-pass streaming algorithm for Column Subset Selection with respect to the entrywise $\ell_p$-norm with $1 \leq p < 2$. We study the $\ell_p$ norm loss since it is often considered more robust to noise than the standard Frobenius norm. Given an input matrix $A \in \mathbb{R}^{d \times n}$ ($n \gg d$), our algorithm achieves a multiplicative $k^{\frac{1}{p} - \frac{1}{2}}\poly(\log nd)$-approximation to the error with respect to the \textit{best possible column subset} of size $k$. Furthermore, the space complexity of the streaming algorithm is optimal up to a logarithmic factor. Our streaming algorithm also extends naturally to a 1-round distributed protocol with nearly optimal communication cost. A key ingredient in our algorithms is a reduction to column subset selection in the $\ell_{p,2}$-norm, which corresponds to the $p$-norm of the vector of Euclidean norms of each of the columns of $A$. This enables us to leverage strong coreset constructions for the Euclidean norm, which previously had not been applied in this context. We also give the first provable guarantees for greedy column subset selection in the $\ell_{1, 2}$ norm, which can be used as an alternative, practical subroutine in our algorithms. Finally, we show that our algorithms give significant practical advantages on real-world data analysis tasks.} }
%0 Conference Paper %T Streaming and Distributed Algorithms for Robust Column Subset Selection %A Shuli Jiang %A Dennis Li %A Irene Mengze Li %A Arvind V Mahankali %A David Woodruff %B Proceedings of the 38th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Marina Meila %E Tong Zhang %F pmlr-v139-jiang21e %I PMLR %P 4971--4981 %U %V 139 %X We give the first single-pass streaming algorithm for Column Subset Selection with respect to the entrywise $\ell_p$-norm with $1 \leq p < 2$. We study the $\ell_p$ norm loss since it is often considered more robust to noise than the standard Frobenius norm. Given an input matrix $A \in \mathbb{R}^{d \times n}$ ($n \gg d$), our algorithm achieves a multiplicative $k^{\frac{1}{p} - \frac{1}{2}}\poly(\log nd)$-approximation to the error with respect to the \textit{best possible column subset} of size $k$. Furthermore, the space complexity of the streaming algorithm is optimal up to a logarithmic factor. Our streaming algorithm also extends naturally to a 1-round distributed protocol with nearly optimal communication cost. A key ingredient in our algorithms is a reduction to column subset selection in the $\ell_{p,2}$-norm, which corresponds to the $p$-norm of the vector of Euclidean norms of each of the columns of $A$. This enables us to leverage strong coreset constructions for the Euclidean norm, which previously had not been applied in this context. We also give the first provable guarantees for greedy column subset selection in the $\ell_{1, 2}$ norm, which can be used as an alternative, practical subroutine in our algorithms. Finally, we show that our algorithms give significant practical advantages on real-world data analysis tasks.
Jiang, S., Li, D., Li, I.M., Mahankali, A.V. & Woodruff, D.. (2021). Streaming and Distributed Algorithms for Robust Column Subset Selection. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:4971-4981 Available from

Related Material