[edit]
The Aggregation–Heterogeneity Trade-off in Federated Learning
Proceedings of Thirty Sixth Conference on Learning Theory, PMLR 195:5478-5502, 2023.
Abstract
Conventional wisdom in machine learning holds that the more data you train your model on, the better the model can perform. Accordingly, a plethora of federated learning methods have been developed to aggregate as many local samples as possible. Contrary to this belief, this paper shows that aggregation of more data is not necessarily beneficial in the presence of heterogeneity, and reveals a fundamental trade-off between aggregation and heterogeneity in federated learning. We consider a general family of weighted $M$-estimators that interpolate between FedAvg and the local estimator, in which an aggregation rule is determined by the weights of local samples. We derive an upper bound for the estimation error of the weighted $M$-estimators, which decomposes into a bias term induced by heterogeneity and a variance term influenced by aggregation. A measure of heterogeneity, the federated smoothness $\beta$, is introduced to simplify the general result. As an important consequence, the optimal aggregation rule for each local device is to aggregate only its $\lfloor K^{2\beta/(2\beta+1)}/(n/\sigma^2)^{1/(2\beta+1)}\rfloor\vee 1$ closest neighbors among the $K$ devices, where $n$ is the local sample size and $\sigma^2$ is the noise variance. Moreover, we show that our estimator, termed FedKNN, attains the minimax optimal rate over a certain parameter space characterized by $\beta$. This optimal procedure depends crucially on the neighboring structure among devices in terms of the proximity of local parameters. Finally, we prove that without such prior knowledge no estimator can achieve a convergence rate faster than $O(\sigma^2/n)$ and hence adaptation is impossible.