Topology-aware Generalization of Decentralized SGD

Tongtian Zhu; Fengxiang He; Lan Zhang; Zhengyang Niu; Mingli Song; Dacheng Tao

Topology-aware Generalization of Decentralized SGD

Tongtian Zhu, Fengxiang He, Lan Zhang, Zhengyang Niu, Mingli Song, Dacheng Tao

Proceedings of the 39th International Conference on Machine Learning, PMLR 162:27479-27503, 2022.

Abstract

This paper studies the algorithmic stability and generalizability of decentralized stochastic gradient descent (D-SGD). We prove that the consensus model learned by D-SGD is

$\mathcal{O}{(m/N\unaryplus1/m\unaryplus\lambda^2)}$ -stable in expectation in the non-convex non-smooth setting, where

$N$ is the total sample size of the whole system,

$m$ is the worker number, and

$1\unaryminus\lambda$ is the spectral gap that measures the connectivity of the communication topology. These results then deliver an

$\mathcal{O}{(1/N\unaryplus{({(m^{-1}\lambda^2)}^{\frac{\alpha}{2}}\unaryplus m^{\unaryminus\alpha})}/{N^{1\unaryminus\frac{\alpha}{2}}})}$ in-average generalization bound, which is non-vacuous even when

$\lambda$ is closed to

$1$ , in contrast to vacuous as suggested by existing literature on the projected version of D-SGD. Our theory indicates that the generalizability of D-SGD has a positive correlation with the spectral gap, and can explain why consensus control in initial training phase can ensure better generalization. Experiments of VGG-11 and ResNet-18 on CIFAR-10, CIFAR-100 and Tiny-ImageNet justify our theory. To our best knowledge, this is the first work on the topology-aware generalization of vanilla D-SGD. Code is available at \url{https://github.com/Raiden-Zhu/Generalization-of-DSGD}.

Cite this Paper

BibTeX


@InProceedings{pmlr-v162-zhu22d,
  title = 	 {Topology-aware Generalization of Decentralized {SGD}},
  author =       {Zhu, Tongtian and He, Fengxiang and Zhang, Lan and Niu, Zhengyang and Song, Mingli and Tao, Dacheng},
  booktitle = 	 {Proceedings of the 39th International Conference on Machine Learning},
  pages = 	 {27479--27503},
  year = 	 {2022},
  editor = 	 {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan},
  volume = 	 {162},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {17--23 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v162/zhu22d/zhu22d.pdf},
  url = 	 {https://proceedings.mlr.press/v162/zhu22d.html},
  abstract = 	 {This paper studies the algorithmic stability and generalizability of decentralized stochastic gradient descent (D-SGD). We prove that the consensus model learned by D-SGD is $\mathcal{O}{(m/N\unaryplus1/m\unaryplus\lambda^2)}$-stable in expectation in the non-convex non-smooth setting, where $N$ is the total sample size of the whole system, $m$ is the worker number, and $1\unaryminus\lambda$ is the spectral gap that measures the connectivity of the communication topology. These results then deliver an $\mathcal{O}{(1/N\unaryplus{({(m^{-1}\lambda^2)}^{\frac{\alpha}{2}}\unaryplus m^{\unaryminus\alpha})}/{N^{1\unaryminus\frac{\alpha}{2}}})}$ in-average generalization bound, which is non-vacuous even when $\lambda$ is closed to $1$, in contrast to vacuous as suggested by existing literature on the projected version of D-SGD. Our theory indicates that the generalizability of D-SGD has a positive correlation with the spectral gap, and can explain why consensus control in initial training phase can ensure better generalization. Experiments of VGG-11 and ResNet-18 on CIFAR-10, CIFAR-100 and Tiny-ImageNet justify our theory. To our best knowledge, this is the first work on the topology-aware generalization of vanilla D-SGD. Code is available at \url{https://github.com/Raiden-Zhu/Generalization-of-DSGD}.}
}

Endnote

%0 Conference Paper
%T Topology-aware Generalization of Decentralized SGD
%A Tongtian Zhu
%A Fengxiang He
%A Lan Zhang
%A Zhengyang Niu
%A Mingli Song
%A Dacheng Tao
%B Proceedings of the 39th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2022
%E Kamalika Chaudhuri
%E Stefanie Jegelka
%E Le Song
%E Csaba Szepesvari
%E Gang Niu
%E Sivan Sabato	
%F pmlr-v162-zhu22d
%I PMLR
%P 27479--27503
%U https://proceedings.mlr.press/v162/zhu22d.html
%V 162
%X This paper studies the algorithmic stability and generalizability of decentralized stochastic gradient descent (D-SGD). We prove that the consensus model learned by D-SGD is $\mathcal{O}{(m/N\unaryplus1/m\unaryplus\lambda^2)}$-stable in expectation in the non-convex non-smooth setting, where $N$ is the total sample size of the whole system, $m$ is the worker number, and $1\unaryminus\lambda$ is the spectral gap that measures the connectivity of the communication topology. These results then deliver an $\mathcal{O}{(1/N\unaryplus{({(m^{-1}\lambda^2)}^{\frac{\alpha}{2}}\unaryplus m^{\unaryminus\alpha})}/{N^{1\unaryminus\frac{\alpha}{2}}})}$ in-average generalization bound, which is non-vacuous even when $\lambda$ is closed to $1$, in contrast to vacuous as suggested by existing literature on the projected version of D-SGD. Our theory indicates that the generalizability of D-SGD has a positive correlation with the spectral gap, and can explain why consensus control in initial training phase can ensure better generalization. Experiments of VGG-11 and ResNet-18 on CIFAR-10, CIFAR-100 and Tiny-ImageNet justify our theory. To our best knowledge, this is the first work on the topology-aware generalization of vanilla D-SGD. Code is available at \url{https://github.com/Raiden-Zhu/Generalization-of-DSGD}.

APA


Zhu, T., He, F., Zhang, L., Niu, Z., Song, M. & Tao, D.. (2022). Topology-aware Generalization of Decentralized SGD. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:27479-27503 Available from https://proceedings.mlr.press/v162/zhu22d.html.

Related Material

Download PDF