Winnowing with Gradient Descent

Ehsan Amid, Manfred K. Warmuth
; Proceedings of Thirty Third Conference on Learning Theory, PMLR 125:163-182, 2020.

Abstract

The performance of multiplicative updates is typically logarithmic in the number of features when the targets are sparse. Strikingly, we show that the same property can also be achieved with gradient descent updates. We obtain this result by rewriting the non-negative weights $w_i$ of multiplicative updates by $u_i^2$ and then performing a gradient descent step w.r.t. the new $u_i$ parameters. We apply this method to the Winnow update, the Hedge update, and the unnormalized and normalized exponentiated gradient (EG) updates for linear regression. When the original weights $w_i$ are scaled to sum to one (as done for Hedge and normalized EG), then in the corresponding reparameterized update, the $u_i$ parameters are now divided by $\Vert\mathbf{u}\Vert_2$ after the gradient descent step. We show that these reparameterizations closely track the original multiplicative updates by proving in each case the same online regret bounds (albeit in some cases, with slightly different constants). As a side, our work exhibits a simple two-layer linear neural network that, when trained with gradient descent, can experimentally solve a certain sparse linear problem (known as the Hadamard problem) with exponentially fewer examples than any kernel method.

Cite this Paper


BibTeX
@InProceedings{pmlr-v125-amid20a, title = {Winnowing with Gradient Descent}, author = {Amid, Ehsan and Warmuth, Manfred K.}, pages = {163--182}, year = {2020}, editor = {Jacob Abernethy and Shivani Agarwal}, volume = {125}, series = {Proceedings of Machine Learning Research}, address = {}, month = {09--12 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v125/amid20a/amid20a.pdf}, url = {http://proceedings.mlr.press/v125/amid20a.html}, abstract = { The performance of multiplicative updates is typically logarithmic in the number of features when the targets are sparse. Strikingly, we show that the same property can also be achieved with gradient descent updates. We obtain this result by rewriting the non-negative weights $w_i$ of multiplicative updates by $u_i^2$ and then performing a gradient descent step w.r.t. the new $u_i$ parameters. We apply this method to the Winnow update, the Hedge update, and the unnormalized and normalized exponentiated gradient (EG) updates for linear regression. When the original weights $w_i$ are scaled to sum to one (as done for Hedge and normalized EG), then in the corresponding reparameterized update, the $u_i$ parameters are now divided by $\Vert\mathbf{u}\Vert_2$ after the gradient descent step. We show that these reparameterizations closely track the original multiplicative updates by proving in each case the same online regret bounds (albeit in some cases, with slightly different constants). As a side, our work exhibits a simple two-layer linear neural network that, when trained with gradient descent, can experimentally solve a certain sparse linear problem (known as the Hadamard problem) with exponentially fewer examples than any kernel method.} }
Endnote
%0 Conference Paper %T Winnowing with Gradient Descent %A Ehsan Amid %A Manfred K. Warmuth %B Proceedings of Thirty Third Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2020 %E Jacob Abernethy %E Shivani Agarwal %F pmlr-v125-amid20a %I PMLR %J Proceedings of Machine Learning Research %P 163--182 %U http://proceedings.mlr.press %V 125 %W PMLR %X The performance of multiplicative updates is typically logarithmic in the number of features when the targets are sparse. Strikingly, we show that the same property can also be achieved with gradient descent updates. We obtain this result by rewriting the non-negative weights $w_i$ of multiplicative updates by $u_i^2$ and then performing a gradient descent step w.r.t. the new $u_i$ parameters. We apply this method to the Winnow update, the Hedge update, and the unnormalized and normalized exponentiated gradient (EG) updates for linear regression. When the original weights $w_i$ are scaled to sum to one (as done for Hedge and normalized EG), then in the corresponding reparameterized update, the $u_i$ parameters are now divided by $\Vert\mathbf{u}\Vert_2$ after the gradient descent step. We show that these reparameterizations closely track the original multiplicative updates by proving in each case the same online regret bounds (albeit in some cases, with slightly different constants). As a side, our work exhibits a simple two-layer linear neural network that, when trained with gradient descent, can experimentally solve a certain sparse linear problem (known as the Hadamard problem) with exponentially fewer examples than any kernel method.
APA
Amid, E. & Warmuth, M.K.. (2020). Winnowing with Gradient Descent. Proceedings of Thirty Third Conference on Learning Theory, in PMLR 125:163-182

Related Material