An Analysis of the tSNE Algorithm for Data Visualization
[edit]
Proceedings of the 31st Conference On Learning Theory, PMLR 75:14551462, 2018.
Abstract
A first line of attack in exploratory data analysis is \emph{data visualization}, i.e., generating a 2dimensional representation of data that makes \emph{clusters} of similar points visually identifiable. Standard JohnsonLindenstrauss dimensionality reduction does not produce data visualizations. The \emph{tSNE} heuristic of van der Maaten and Hinton, which is based on nonconvex optimization, has become the \emph{de facto} standard for visualization in a wide range of applications. This work gives a formal framework for the problem of data visualization – finding a 2dimensional embedding of clusterable data that correctly separates individual clusters to make them visually identifiable. We then give a rigorous analysis of the performance of tSNE under a natural, deterministic condition on the “groundtruth” clusters (similar to conditions assumed in earlier analyses of clustering) in the underlying data. These are the first provable guarantees on tSNE for constructing good data visualizations. We show that our deterministic condition is satisfied by considerably general probabilistic generative models for clusterable data such as mixtures of wellseparated logconcave distributions. Finally, we give theoretical evidence that tSNE provably succeeds in \emph{partially} recovering cluster structure even when the above deterministic condition is not met.
Related Material


