Identification of mixtures of discrete product distributions in near-optimal sample and time complexity

Spencer L. Gordon, Erik Jahn, Bijan Mazaheri, Yuval Rabani, Leonard J. Schulman
Proceedings of Thirty Seventh Conference on Learning Theory, PMLR 247:2071-2091, 2024.

Abstract

We consider the problem of \emph{identifying,} from statistics, a distribution of discrete random variables X1,Xn that is a mixture of k product distributions. The best previous sample complexity for nO(k) was (1/ζ)O(k2logk) (under a mild separation assumption parameterized by ζ). The best known lower bound was exp(Ω(k)). It is known that n2k1 is necessary and sufficient for identification. We show, for any n2k1, how to achieve sample complexity and run-time complexity (1/ζ)O(k). We also extend the known lower bound of eΩ(k) to match our upper bound across a broad range of ζ. Our results are obtained by combining (a) a classic method for robust tensor decomposition, (b) a novel way of bounding the condition number of key matrices called Hadamard extensions, by studying their action only on flattened rank-1 tensors.

Cite this Paper


BibTeX
@InProceedings{pmlr-v247-gordon24a, title = {Identification of mixtures of discrete product distributions in near-optimal sample and time complexity}, author = {Gordon, Spencer L. and Jahn, Erik and Mazaheri, Bijan and Rabani, Yuval and Schulman, Leonard J.}, booktitle = {Proceedings of Thirty Seventh Conference on Learning Theory}, pages = {2071--2091}, year = {2024}, editor = {Agrawal, Shipra and Roth, Aaron}, volume = {247}, series = {Proceedings of Machine Learning Research}, month = {30 Jun--03 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v247/gordon24a/gordon24a.pdf}, url = {https://proceedings.mlr.press/v247/gordon24a.html}, abstract = {We consider the problem of \emph{identifying,} from statistics, a distribution of discrete random variables $X_1 \ldots,X_n$ that is a mixture of $k$ product distributions. The best previous sample complexity for $n \in O(k)$ was $(1/\zeta)^{O(k^2 \log k)}$ (under a mild separation assumption parameterized by $\zeta$). The best known lower bound was $\exp(\Omega(k))$. It is known that $n\geq 2k-1$ is necessary and sufficient for identification. We show, for any $n\geq 2k-1$, how to achieve sample complexity and run-time complexity $(1/\zeta)^{O(k)}$. We also extend the known lower bound of $e^{\Omega(k)}$ to match our upper bound across a broad range of $\zeta$. Our results are obtained by combining (a) a classic method for robust tensor decomposition, (b) a novel way of bounding the condition number of key matrices called Hadamard extensions, by studying their action only on flattened rank-1 tensors.} }
Endnote
%0 Conference Paper %T Identification of mixtures of discrete product distributions in near-optimal sample and time complexity %A Spencer L. Gordon %A Erik Jahn %A Bijan Mazaheri %A Yuval Rabani %A Leonard J. Schulman %B Proceedings of Thirty Seventh Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2024 %E Shipra Agrawal %E Aaron Roth %F pmlr-v247-gordon24a %I PMLR %P 2071--2091 %U https://proceedings.mlr.press/v247/gordon24a.html %V 247 %X We consider the problem of \emph{identifying,} from statistics, a distribution of discrete random variables $X_1 \ldots,X_n$ that is a mixture of $k$ product distributions. The best previous sample complexity for $n \in O(k)$ was $(1/\zeta)^{O(k^2 \log k)}$ (under a mild separation assumption parameterized by $\zeta$). The best known lower bound was $\exp(\Omega(k))$. It is known that $n\geq 2k-1$ is necessary and sufficient for identification. We show, for any $n\geq 2k-1$, how to achieve sample complexity and run-time complexity $(1/\zeta)^{O(k)}$. We also extend the known lower bound of $e^{\Omega(k)}$ to match our upper bound across a broad range of $\zeta$. Our results are obtained by combining (a) a classic method for robust tensor decomposition, (b) a novel way of bounding the condition number of key matrices called Hadamard extensions, by studying their action only on flattened rank-1 tensors.
APA
Gordon, S.L., Jahn, E., Mazaheri, B., Rabani, Y. & Schulman, L.J.. (2024). Identification of mixtures of discrete product distributions in near-optimal sample and time complexity. Proceedings of Thirty Seventh Conference on Learning Theory, in Proceedings of Machine Learning Research 247:2071-2091 Available from https://proceedings.mlr.press/v247/gordon24a.html.

Related Material