[edit]

# $D^2$: Decentralized Training over Decentralized Data

*Proceedings of the 35th International Conference on Machine Learning*, PMLR 80:4848-4856, 2018.

#### Abstract

While training a machine learning model using multiple workers, each of which collects data from its own data source, it would be useful when the data collected from different workers are

*unique*and*different*. Ironically, recent analysis of decentralized parallel stochastic gradient descent (D-PSGD) relies on the assumption that the data hosted on different workers are*not too different*. In this paper, we ask the question:*Can we design a decentralized parallel stochastic gradient descent algorithm that is less sensitive to the data variance across workers?*In this paper, we present D$^2$, a novel decentralized parallel stochastic gradient descent algorithm designed for large data variance \xr{among workers} (imprecisely, “decentralized” data). The core of D$^2$ is a variance reduction extension of D-PSGD. It improves the convergence rate from $O\left({\sigma \over \sqrt{nT}} + {(n\zeta^2)^{\frac{1}{3}} \over T^{2/3}}\right)$ to $O\left({\sigma \over \sqrt{nT}}\right)$ where $\zeta^{2}$ denotes the variance among data on different workers. As a result, D$^2$ is robust to data variance among workers. We empirically evaluated D$^2$ on image classification tasks, where each worker has access to only the data of a limited set of labels, and find that D$^2$ significantly outperforms D-PSGD.