[edit]
Winnowing with Gradient Descent
Proceedings of Thirty Third Conference on Learning Theory, PMLR 125:163-182, 2020.
Abstract
The performance of multiplicative updates is typically logarithmic in the number of features when the targets are sparse. Strikingly, we show that the same property can also be achieved with gradient descent updates. We obtain this result by rewriting the non-negative weights $w_i$ of multiplicative updates by $u_i^2$ and then performing a gradient descent step w.r.t. the new $u_i$ parameters. We apply this method to the Winnow update, the Hedge update, and the unnormalized and normalized exponentiated gradient (EG) updates for linear regression. When the original weights $w_i$ are scaled to sum to one (as done for Hedge and normalized EG), then in the corresponding reparameterized update, the $u_i$ parameters are now divided by $\Vert\mathbf{u}\Vert_2$ after the gradient descent step. We show that these reparameterizations closely track the original multiplicative updates by proving in each case the same online regret bounds (albeit in some cases, with slightly different constants). As a side, our work exhibits a simple two-layer linear neural network that, when trained with gradient descent, can experimentally solve a certain sparse linear problem (known as the Hadamard problem) with exponentially fewer examples than any kernel method.