Coordinate Descent Faceoff: Primal or Dual?

Dominik Csiba, Peter Richtárik
Proceedings of Algorithmic Learning Theory, PMLR 83:246-267, 2018.

Abstract

Randomized coordinate descent (RCD) methods are state-of-the-art algorithms for training linear predictors via minimizing regularized empirical risk. When the number of examples (n) is much larger than the number of features (d), a common strategy is to apply RCD to the dual problem. On the other hand, when the number of features is much larger than the number of examples, it makes sense to apply RCD directly to the primal problem. In this paper we provide the first joint study of these two approaches when applied to L2-regularized linear ERM. First, we show through a rigorous analysis that for dense data, the above intuition is precisely correct. However, we find that for sparse and structured data, primal RCD can significantly outperform dual RCD even if $d << n$, and vice versa, dual RCD can be much faster than primal RCD even if $n << d$. Moreover, we show that, surprisingly, a single sampling strategy minimizes both the (bound on the) number of iterations and the overall expected complexity of RCD. Note that the latter complexity measure also takes into account the average cost of the iterations, which depends on the structure and sparsity of the data, and on the sampling strategy employed. We confirm our theoretical predictions using extensive experiments with both synthetic and real data sets.

Cite this Paper


BibTeX
@InProceedings{pmlr-v83-csiba18a, title = {Coordinate Descent Faceoff: Primal or Dual?}, author = {Csiba, Dominik and Richtárik, Peter}, booktitle = {Proceedings of Algorithmic Learning Theory}, pages = {246--267}, year = {2018}, editor = {Janoos, Firdaus and Mohri, Mehryar and Sridharan, Karthik}, volume = {83}, series = {Proceedings of Machine Learning Research}, month = {07--09 Apr}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v83/csiba18a/csiba18a.pdf}, url = {https://proceedings.mlr.press/v83/csiba18a.html}, abstract = {Randomized coordinate descent (RCD) methods are state-of-the-art algorithms for training linear predictors via minimizing regularized empirical risk. When the number of examples (n) is much larger than the number of features (d), a common strategy is to apply RCD to the dual problem. On the other hand, when the number of features is much larger than the number of examples, it makes sense to apply RCD directly to the primal problem. In this paper we provide the first joint study of these two approaches when applied to L2-regularized linear ERM. First, we show through a rigorous analysis that for dense data, the above intuition is precisely correct. However, we find that for sparse and structured data, primal RCD can significantly outperform dual RCD even if $d << n$, and vice versa, dual RCD can be much faster than primal RCD even if $n << d$. Moreover, we show that, surprisingly, a single sampling strategy minimizes both the (bound on the) number of iterations and the overall expected complexity of RCD. Note that the latter complexity measure also takes into account the average cost of the iterations, which depends on the structure and sparsity of the data, and on the sampling strategy employed. We confirm our theoretical predictions using extensive experiments with both synthetic and real data sets. } }
Endnote
%0 Conference Paper %T Coordinate Descent Faceoff: Primal or Dual? %A Dominik Csiba %A Peter Richtárik %B Proceedings of Algorithmic Learning Theory %C Proceedings of Machine Learning Research %D 2018 %E Firdaus Janoos %E Mehryar Mohri %E Karthik Sridharan %F pmlr-v83-csiba18a %I PMLR %P 246--267 %U https://proceedings.mlr.press/v83/csiba18a.html %V 83 %X Randomized coordinate descent (RCD) methods are state-of-the-art algorithms for training linear predictors via minimizing regularized empirical risk. When the number of examples (n) is much larger than the number of features (d), a common strategy is to apply RCD to the dual problem. On the other hand, when the number of features is much larger than the number of examples, it makes sense to apply RCD directly to the primal problem. In this paper we provide the first joint study of these two approaches when applied to L2-regularized linear ERM. First, we show through a rigorous analysis that for dense data, the above intuition is precisely correct. However, we find that for sparse and structured data, primal RCD can significantly outperform dual RCD even if $d << n$, and vice versa, dual RCD can be much faster than primal RCD even if $n << d$. Moreover, we show that, surprisingly, a single sampling strategy minimizes both the (bound on the) number of iterations and the overall expected complexity of RCD. Note that the latter complexity measure also takes into account the average cost of the iterations, which depends on the structure and sparsity of the data, and on the sampling strategy employed. We confirm our theoretical predictions using extensive experiments with both synthetic and real data sets.
APA
Csiba, D. & Richtárik, P.. (2018). Coordinate Descent Faceoff: Primal or Dual?. Proceedings of Algorithmic Learning Theory, in Proceedings of Machine Learning Research 83:246-267 Available from https://proceedings.mlr.press/v83/csiba18a.html.

Related Material