EM for Mixture of Linear Regression with Clustered Data

Amirhossein Reisizadeh; Khashayar Gatmiry; Asuman Ozdaglar

EM for Mixture of Linear Regression with Clustered Data

Amirhossein Reisizadeh, Khashayar Gatmiry, Asuman Ozdaglar

Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, PMLR 238:2341-2349, 2024.

Abstract

Modern data-driven and distributed learning frameworks deal with diverse massive data generated by clients spread across heterogeneous environments. Indeed, data heterogeneity is a major bottleneck in scaling up many distributed learning paradigms. In many settings however, heterogeneous data may be generated in clusters with shared structures, as is the case in several applications such as federated learning where a common latent variable governs the distribution of all the samples generated by a client. It is therefore natural to ask how the underlying clustered structures in distributed data can be exploited to improve learning schemes. In this paper, we tackle this question in the special case of estimating $d$-dimensional parameters of a two-component mixture of linear regressions problem where each of $m$ nodes generates $n$ samples with a shared latent variable. We employ the well-known Expectation-Maximization (EM) method to estimate the maximum likelihood parameters from m batches of dependent samples each containing n measurements. Discarding the clustered structure in the mixture model, EM is known to require $O(\log(mn/d))$ iterations to reach the statistical accuracy of $O(\sqrt{d/(mn)}$). In contrast, we show that if initialized properly, EM on the structured data requires only $O(1)$ iterations to reach the same statistical accuracy, as long as m grows up as $e^{o(n)}$. Our analysis establishes and combines novel asymptotic optimization and generalization guarantees for population and empirical EM with dependent samples, which may be of independent interest.

Cite this Paper

BibTeX

@InProceedings{pmlr-v238-reisizadeh24a,
  title = 	 {{EM} for Mixture of Linear Regression with Clustered Data},
  author =       {Reisizadeh, Amirhossein and Gatmiry, Khashayar and Ozdaglar, Asuman},
  booktitle = 	 {Proceedings of The 27th International Conference on Artificial Intelligence and Statistics},
  pages = 	 {2341--2349},
  year = 	 {2024},
  editor = 	 {Dasgupta, Sanjoy and Mandt, Stephan and Li, Yingzhen},
  volume = 	 {238},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {02--04 May},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v238/reisizadeh24a/reisizadeh24a.pdf},
  url = 	 {https://proceedings.mlr.press/v238/reisizadeh24a.html},
  abstract = 	 {Modern data-driven and distributed learning frameworks deal with diverse massive data generated by clients spread across heterogeneous environments. Indeed, data heterogeneity is a major bottleneck in scaling up many distributed learning paradigms. In many settings however, heterogeneous data may be generated in clusters with shared structures, as is the case in several applications such as federated learning where a common latent variable governs the distribution of all the samples generated by a client. It is therefore natural to ask how the underlying clustered structures in distributed data can be exploited to improve learning schemes. In this paper, we tackle this question in the special case of estimating $d$-dimensional parameters of a two-component mixture of linear regressions problem where each of $m$ nodes generates $n$ samples with a shared latent variable. We employ the well-known Expectation-Maximization (EM) method to estimate the maximum likelihood parameters from m batches of dependent samples each containing n measurements. Discarding the clustered structure in the mixture model, EM is known to require $O(\log(mn/d))$ iterations to reach the statistical accuracy of $O(\sqrt{d/(mn)}$). In contrast, we show that if initialized properly, EM on the structured data requires only $O(1)$ iterations to reach the same statistical accuracy, as long as m grows up as $e^{o(n)}$. Our analysis establishes and combines novel asymptotic optimization and generalization guarantees for population and empirical EM with dependent samples, which may be of independent interest.}
}

Endnote

%0 Conference Paper
%T EM for Mixture of Linear Regression with Clustered Data
%A Amirhossein Reisizadeh
%A Khashayar Gatmiry
%A Asuman Ozdaglar
%B Proceedings of The 27th International Conference on Artificial Intelligence and Statistics
%C Proceedings of Machine Learning Research
%D 2024
%E Sanjoy Dasgupta
%E Stephan Mandt
%E Yingzhen Li	
%F pmlr-v238-reisizadeh24a
%I PMLR
%P 2341--2349
%U https://proceedings.mlr.press/v238/reisizadeh24a.html
%V 238
%X Modern data-driven and distributed learning frameworks deal with diverse massive data generated by clients spread across heterogeneous environments. Indeed, data heterogeneity is a major bottleneck in scaling up many distributed learning paradigms. In many settings however, heterogeneous data may be generated in clusters with shared structures, as is the case in several applications such as federated learning where a common latent variable governs the distribution of all the samples generated by a client. It is therefore natural to ask how the underlying clustered structures in distributed data can be exploited to improve learning schemes. In this paper, we tackle this question in the special case of estimating $d$-dimensional parameters of a two-component mixture of linear regressions problem where each of $m$ nodes generates $n$ samples with a shared latent variable. We employ the well-known Expectation-Maximization (EM) method to estimate the maximum likelihood parameters from m batches of dependent samples each containing n measurements. Discarding the clustered structure in the mixture model, EM is known to require $O(\log(mn/d))$ iterations to reach the statistical accuracy of $O(\sqrt{d/(mn)}$). In contrast, we show that if initialized properly, EM on the structured data requires only $O(1)$ iterations to reach the same statistical accuracy, as long as m grows up as $e^{o(n)}$. Our analysis establishes and combines novel asymptotic optimization and generalization guarantees for population and empirical EM with dependent samples, which may be of independent interest.

APA

Reisizadeh, A., Gatmiry, K. & Ozdaglar, A.. (2024). EM for Mixture of Linear Regression with Clustered Data. Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 238:2341-2349 Available from https://proceedings.mlr.press/v238/reisizadeh24a.html.

Related Material

Download PDF