Follow-the-Regularized-Leader and Mirror Descent: Equivalence Theorems and L1 Regularization

Brendan McMahan
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, PMLR 15:525-533, 2011.

Abstract

We prove that many mirror descent algorithms for online convex optimization (such as online gradient descent) have an equivalent interpretation as follow-the-regularized-leader (FTRL) algorithms. This observation makes the relationships between many commonly used algorithms explicit, and provides theoretical insight on previous experimental observations. In particular, even though the FOBOS composite mirror descent algorithm handles $L_1$ regularization explicitly, it has been observed that the FTRL-style Regularized Dual Averaging (RDA) algorithm is even more effective at producing sparsity. Our results demonstrate that the key difference between these algorithms is how they handle the cumulative $L_1$ penalty. While FOBOS handles the $L_1$ term exactly on any given update, we show that it is effectively using subgradient approximations to the $L_1$ penalty from previous rounds, leading to less sparsity than RDA, which handles the cumulative penalty in closed form. The FTRL-Proximal algorithm, which we introduce, can be seen as a hybrid of these two algorithms, and significantly outperforms both on a large, real-world dataset.

Cite this Paper


BibTeX
@InProceedings{pmlr-v15-mcmahan11b, title = {Follow-the-Regularized-Leader and Mirror Descent: Equivalence Theorems and L1 Regularization}, author = {McMahan, Brendan}, booktitle = {Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics}, pages = {525--533}, year = {2011}, editor = {Gordon, Geoffrey and Dunson, David and Dudík, Miroslav}, volume = {15}, series = {Proceedings of Machine Learning Research}, address = {Fort Lauderdale, FL, USA}, month = {11--13 Apr}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v15/mcmahan11b/mcmahan11b.pdf}, url = {https://proceedings.mlr.press/v15/mcmahan11b.html}, abstract = {We prove that many mirror descent algorithms for online convex optimization (such as online gradient descent) have an equivalent interpretation as follow-the-regularized-leader (FTRL) algorithms. This observation makes the relationships between many commonly used algorithms explicit, and provides theoretical insight on previous experimental observations. In particular, even though the FOBOS composite mirror descent algorithm handles $L_1$ regularization explicitly, it has been observed that the FTRL-style Regularized Dual Averaging (RDA) algorithm is even more effective at producing sparsity. Our results demonstrate that the key difference between these algorithms is how they handle the cumulative $L_1$ penalty. While FOBOS handles the $L_1$ term exactly on any given update, we show that it is effectively using subgradient approximations to the $L_1$ penalty from previous rounds, leading to less sparsity than RDA, which handles the cumulative penalty in closed form. The FTRL-Proximal algorithm, which we introduce, can be seen as a hybrid of these two algorithms, and significantly outperforms both on a large, real-world dataset.} }
Endnote
%0 Conference Paper %T Follow-the-Regularized-Leader and Mirror Descent: Equivalence Theorems and L1 Regularization %A Brendan McMahan %B Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2011 %E Geoffrey Gordon %E David Dunson %E Miroslav Dudík %F pmlr-v15-mcmahan11b %I PMLR %P 525--533 %U https://proceedings.mlr.press/v15/mcmahan11b.html %V 15 %X We prove that many mirror descent algorithms for online convex optimization (such as online gradient descent) have an equivalent interpretation as follow-the-regularized-leader (FTRL) algorithms. This observation makes the relationships between many commonly used algorithms explicit, and provides theoretical insight on previous experimental observations. In particular, even though the FOBOS composite mirror descent algorithm handles $L_1$ regularization explicitly, it has been observed that the FTRL-style Regularized Dual Averaging (RDA) algorithm is even more effective at producing sparsity. Our results demonstrate that the key difference between these algorithms is how they handle the cumulative $L_1$ penalty. While FOBOS handles the $L_1$ term exactly on any given update, we show that it is effectively using subgradient approximations to the $L_1$ penalty from previous rounds, leading to less sparsity than RDA, which handles the cumulative penalty in closed form. The FTRL-Proximal algorithm, which we introduce, can be seen as a hybrid of these two algorithms, and significantly outperforms both on a large, real-world dataset.
RIS
TY - CPAPER TI - Follow-the-Regularized-Leader and Mirror Descent: Equivalence Theorems and L1 Regularization AU - Brendan McMahan BT - Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics DA - 2011/06/14 ED - Geoffrey Gordon ED - David Dunson ED - Miroslav Dudík ID - pmlr-v15-mcmahan11b PB - PMLR DP - Proceedings of Machine Learning Research VL - 15 SP - 525 EP - 533 L1 - http://proceedings.mlr.press/v15/mcmahan11b/mcmahan11b.pdf UR - https://proceedings.mlr.press/v15/mcmahan11b.html AB - We prove that many mirror descent algorithms for online convex optimization (such as online gradient descent) have an equivalent interpretation as follow-the-regularized-leader (FTRL) algorithms. This observation makes the relationships between many commonly used algorithms explicit, and provides theoretical insight on previous experimental observations. In particular, even though the FOBOS composite mirror descent algorithm handles $L_1$ regularization explicitly, it has been observed that the FTRL-style Regularized Dual Averaging (RDA) algorithm is even more effective at producing sparsity. Our results demonstrate that the key difference between these algorithms is how they handle the cumulative $L_1$ penalty. While FOBOS handles the $L_1$ term exactly on any given update, we show that it is effectively using subgradient approximations to the $L_1$ penalty from previous rounds, leading to less sparsity than RDA, which handles the cumulative penalty in closed form. The FTRL-Proximal algorithm, which we introduce, can be seen as a hybrid of these two algorithms, and significantly outperforms both on a large, real-world dataset. ER -
APA
McMahan, B.. (2011). Follow-the-Regularized-Leader and Mirror Descent: Equivalence Theorems and L1 Regularization. Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 15:525-533 Available from https://proceedings.mlr.press/v15/mcmahan11b.html.

Related Material