Identification of mixtures of discrete product distributions in near-optimal sample and time complexity

Spencer L. Gordon, Erik Jahn, Bijan Mazaheri, Yuval Rabani, Leonard J. Schulman
Proceedings of Thirty Seventh Conference on Learning Theory, PMLR 247:2071-2091, 2024.

Abstract

We consider the problem of \emph{identifying,} from statistics, a distribution of discrete random variables $X_1 \ldots,X_n$ that is a mixture of $k$ product distributions. The best previous sample complexity for $n \in O(k)$ was $(1/\zeta)^{O(k^2 \log k)}$ (under a mild separation assumption parameterized by $\zeta$). The best known lower bound was $\exp(\Omega(k))$. It is known that $n\geq 2k-1$ is necessary and sufficient for identification. We show, for any $n\geq 2k-1$, how to achieve sample complexity and run-time complexity $(1/\zeta)^{O(k)}$. We also extend the known lower bound of $e^{\Omega(k)}$ to match our upper bound across a broad range of $\zeta$. Our results are obtained by combining (a) a classic method for robust tensor decomposition, (b) a novel way of bounding the condition number of key matrices called Hadamard extensions, by studying their action only on flattened rank-1 tensors.

Cite this Paper


BibTeX
@InProceedings{pmlr-v247-gordon24a, title = {Identification of mixtures of discrete product distributions in near-optimal sample and time complexity}, author = {Gordon, Spencer L. and Jahn, Erik and Mazaheri, Bijan and Rabani, Yuval and Schulman, Leonard J.}, booktitle = {Proceedings of Thirty Seventh Conference on Learning Theory}, pages = {2071--2091}, year = {2024}, editor = {Agrawal, Shipra and Roth, Aaron}, volume = {247}, series = {Proceedings of Machine Learning Research}, month = {30 Jun--03 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v247/gordon24a/gordon24a.pdf}, url = {https://proceedings.mlr.press/v247/gordon24a.html}, abstract = {We consider the problem of \emph{identifying,} from statistics, a distribution of discrete random variables $X_1 \ldots,X_n$ that is a mixture of $k$ product distributions. The best previous sample complexity for $n \in O(k)$ was $(1/\zeta)^{O(k^2 \log k)}$ (under a mild separation assumption parameterized by $\zeta$). The best known lower bound was $\exp(\Omega(k))$. It is known that $n\geq 2k-1$ is necessary and sufficient for identification. We show, for any $n\geq 2k-1$, how to achieve sample complexity and run-time complexity $(1/\zeta)^{O(k)}$. We also extend the known lower bound of $e^{\Omega(k)}$ to match our upper bound across a broad range of $\zeta$. Our results are obtained by combining (a) a classic method for robust tensor decomposition, (b) a novel way of bounding the condition number of key matrices called Hadamard extensions, by studying their action only on flattened rank-1 tensors.} }
Endnote
%0 Conference Paper %T Identification of mixtures of discrete product distributions in near-optimal sample and time complexity %A Spencer L. Gordon %A Erik Jahn %A Bijan Mazaheri %A Yuval Rabani %A Leonard J. Schulman %B Proceedings of Thirty Seventh Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2024 %E Shipra Agrawal %E Aaron Roth %F pmlr-v247-gordon24a %I PMLR %P 2071--2091 %U https://proceedings.mlr.press/v247/gordon24a.html %V 247 %X We consider the problem of \emph{identifying,} from statistics, a distribution of discrete random variables $X_1 \ldots,X_n$ that is a mixture of $k$ product distributions. The best previous sample complexity for $n \in O(k)$ was $(1/\zeta)^{O(k^2 \log k)}$ (under a mild separation assumption parameterized by $\zeta$). The best known lower bound was $\exp(\Omega(k))$. It is known that $n\geq 2k-1$ is necessary and sufficient for identification. We show, for any $n\geq 2k-1$, how to achieve sample complexity and run-time complexity $(1/\zeta)^{O(k)}$. We also extend the known lower bound of $e^{\Omega(k)}$ to match our upper bound across a broad range of $\zeta$. Our results are obtained by combining (a) a classic method for robust tensor decomposition, (b) a novel way of bounding the condition number of key matrices called Hadamard extensions, by studying their action only on flattened rank-1 tensors.
APA
Gordon, S.L., Jahn, E., Mazaheri, B., Rabani, Y. & Schulman, L.J.. (2024). Identification of mixtures of discrete product distributions in near-optimal sample and time complexity. Proceedings of Thirty Seventh Conference on Learning Theory, in Proceedings of Machine Learning Research 247:2071-2091 Available from https://proceedings.mlr.press/v247/gordon24a.html.

Related Material