[edit]
A Distribution Testing Approach to Clustering Distributions
Proceedings of Thirty Ninth Conference on Learning Theory, PMLR 336:4308-4348, 2026.
Abstract
We study the following distribution clustering problem: Given a hidden partition of $k$ distributions into $2$ groups, such that the distributions within each group are the same, and the distributions associated with the clusters are pairwise $\varepsilon$-far in total variation, the goal is to recover the partition. We establish upper and lower bounds on the sample complexity for two fundamental cases: (1) when one of the cluster’s distributions is known, and (2) when both are unknown. Our upper and lower bounds characterize the sample complexity’s dependence on the domain size $n$, number of distributions $k$, size $r$ of one of the clusters, and distance $\varepsilon$. In particular, we achieve tightness with respect to $(n,k,r,\varepsilon)$ (up to an $O(\log k)$ factor) for all regimes. In addition, we show that this result extends to the case of $d$-clustering for any constant number of clusters $d$.