Shampoo: Preconditioned Stochastic Tensor Optimization

Vineet Gupta; Tomer Koren; Yoram Singer

Shampoo: Preconditioned Stochastic Tensor Optimization

Vineet Gupta, Tomer Koren, Yoram Singer

Proceedings of the 35th International Conference on Machine Learning, PMLR 80:1842-1850, 2018.

Abstract

Preconditioned gradient methods are among the most general and powerful tools in optimization. However, preconditioning requires storing and manipulating prohibitively large matrices. We describe and analyze a new structure-aware preconditioning algorithm, called Shampoo, for stochastic optimization over tensor spaces. Shampoo maintains a set of preconditioning matrices, each of which operates on a single dimension, contracting over the remaining dimensions. We establish convergence guarantees in the stochastic convex setting, the proof of which builds upon matrix trace inequalities. Our experiments with state-of-the-art deep learning models show that Shampoo is capable of converging considerably faster than commonly used optimizers. Surprisingly, although it involves a more complex update rule, Shampoo’s runtime per step is comparable in practice to that of simple gradient methods such as SGD, AdaGrad, and Adam.

Cite this Paper

BibTeX


@InProceedings{pmlr-v80-gupta18a,
  title = 	 {Shampoo: Preconditioned Stochastic Tensor Optimization},
  author =       {Gupta, Vineet and Koren, Tomer and Singer, Yoram},
  booktitle = 	 {Proceedings of the 35th International Conference on Machine Learning},
  pages = 	 {1842--1850},
  year = 	 {2018},
  editor = 	 {Dy, Jennifer and Krause, Andreas},
  volume = 	 {80},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {10--15 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v80/gupta18a/gupta18a.pdf},
  url = 	 {https://proceedings.mlr.press/v80/gupta18a.html},
  abstract = 	 {Preconditioned gradient methods are among the most general and powerful tools in optimization. However, preconditioning requires storing and manipulating prohibitively large matrices. We describe and analyze a new structure-aware preconditioning algorithm, called Shampoo, for stochastic optimization over tensor spaces. Shampoo maintains a set of preconditioning matrices, each of which operates on a single dimension, contracting over the remaining dimensions. We establish convergence guarantees in the stochastic convex setting, the proof of which builds upon matrix trace inequalities. Our experiments with state-of-the-art deep learning models show that Shampoo is capable of converging considerably faster than commonly used optimizers. Surprisingly, although it involves a more complex update rule, Shampoo’s runtime per step is comparable in practice to that of simple gradient methods such as SGD, AdaGrad, and Adam.}
}

Endnote

%0 Conference Paper
%T Shampoo: Preconditioned Stochastic Tensor Optimization
%A Vineet Gupta
%A Tomer Koren
%A Yoram Singer
%B Proceedings of the 35th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2018
%E Jennifer Dy
%E Andreas Krause	
%F pmlr-v80-gupta18a
%I PMLR
%P 1842--1850
%U https://proceedings.mlr.press/v80/gupta18a.html
%V 80
%X Preconditioned gradient methods are among the most general and powerful tools in optimization. However, preconditioning requires storing and manipulating prohibitively large matrices. We describe and analyze a new structure-aware preconditioning algorithm, called Shampoo, for stochastic optimization over tensor spaces. Shampoo maintains a set of preconditioning matrices, each of which operates on a single dimension, contracting over the remaining dimensions. We establish convergence guarantees in the stochastic convex setting, the proof of which builds upon matrix trace inequalities. Our experiments with state-of-the-art deep learning models show that Shampoo is capable of converging considerably faster than commonly used optimizers. Surprisingly, although it involves a more complex update rule, Shampoo’s runtime per step is comparable in practice to that of simple gradient methods such as SGD, AdaGrad, and Adam.

APA


Gupta, V., Koren, T. & Singer, Y.. (2018). Shampoo: Preconditioned Stochastic Tensor Optimization. Proceedings of the 35th International Conference on Machine Learning, in Proceedings of Machine Learning Research 80:1842-1850 Available from https://proceedings.mlr.press/v80/gupta18a.html.

Related Material

Download PDF