An Analysis of the t-SNE Algorithm for Data Visualization

Sanjeev Arora; Wei Hu; Pravesh K. Kothari

An Analysis of the t-SNE Algorithm for Data Visualization

Sanjeev Arora, Wei Hu, Pravesh K. Kothari

Proceedings of the 31st Conference On Learning Theory, PMLR 75:1455-1462, 2018.

Abstract

A first line of attack in exploratory data analysis is \emph{data visualization}, i.e., generating a 2-dimensional representation of data that makes \emph{clusters} of similar points visually identifiable. Standard Johnson-Lindenstrauss dimensionality reduction does not produce data visualizations. The \emph{t-SNE} heuristic of van der Maaten and Hinton, which is based on non-convex optimization, has become the \emph{de facto} standard for visualization in a wide range of applications. This work gives a formal framework for the problem of data visualization – finding a 2-dimensional embedding of clusterable data that correctly separates individual clusters to make them visually identifiable. We then give a rigorous analysis of the performance of t-SNE under a natural, deterministic condition on the “ground-truth” clusters (similar to conditions assumed in earlier analyses of clustering) in the underlying data. These are the first provable guarantees on t-SNE for constructing good data visualizations. We show that our deterministic condition is satisfied by considerably general probabilistic generative models for clusterable data such as mixtures of well-separated log-concave distributions. Finally, we give theoretical evidence that t-SNE provably succeeds in \emph{partially} recovering cluster structure even when the above deterministic condition is not met.

Cite this Paper

BibTeX


@InProceedings{pmlr-v75-arora18a,
  title = 	 {An Analysis of the t-SNE Algorithm for Data Visualization},
  author =       {Arora, Sanjeev and Hu, Wei and Kothari, Pravesh K.},
  booktitle = 	 {Proceedings of the 31st  Conference On Learning Theory},
  pages = 	 {1455--1462},
  year = 	 {2018},
  editor = 	 {Bubeck, Sébastien and Perchet, Vianney and Rigollet, Philippe},
  volume = 	 {75},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {06--09 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v75/arora18a/arora18a.pdf},
  url = 	 {https://proceedings.mlr.press/v75/arora18a.html},
  abstract = 	 {A first line of attack in exploratory data analysis is \emph{data visualization}, i.e., generating a 2-dimensional representation of data that makes \emph{clusters} of similar points visually identifiable. Standard Johnson-Lindenstrauss dimensionality reduction does not produce data visualizations. The \emph{t-SNE} heuristic of van der Maaten and Hinton, which is based on non-convex optimization, has become the \emph{de facto} standard for visualization in a wide range of applications. This work gives a formal framework for the problem of data visualization – finding a 2-dimensional embedding of clusterable data that correctly separates individual clusters to make them visually identifiable. We then give a rigorous analysis of the performance of t-SNE under a natural, deterministic condition on the “ground-truth” clusters (similar to conditions assumed in earlier analyses of clustering) in the underlying data. These are the first provable guarantees on t-SNE for constructing good data visualizations.  We show that our deterministic condition is satisfied by considerably general probabilistic generative models for clusterable data such as mixtures of well-separated log-concave distributions. Finally, we give theoretical evidence that t-SNE provably succeeds in \emph{partially} recovering cluster structure even when the above deterministic condition is not met.}
}

Endnote

%0 Conference Paper
%T An Analysis of the t-SNE Algorithm for Data Visualization
%A Sanjeev Arora
%A Wei Hu
%A Pravesh K. Kothari
%B Proceedings of the 31st  Conference On Learning Theory
%C Proceedings of Machine Learning Research
%D 2018
%E Sébastien Bubeck
%E Vianney Perchet
%E Philippe Rigollet	
%F pmlr-v75-arora18a
%I PMLR
%P 1455--1462
%U https://proceedings.mlr.press/v75/arora18a.html
%V 75
%X A first line of attack in exploratory data analysis is \emph{data visualization}, i.e., generating a 2-dimensional representation of data that makes \emph{clusters} of similar points visually identifiable. Standard Johnson-Lindenstrauss dimensionality reduction does not produce data visualizations. The \emph{t-SNE} heuristic of van der Maaten and Hinton, which is based on non-convex optimization, has become the \emph{de facto} standard for visualization in a wide range of applications. This work gives a formal framework for the problem of data visualization – finding a 2-dimensional embedding of clusterable data that correctly separates individual clusters to make them visually identifiable. We then give a rigorous analysis of the performance of t-SNE under a natural, deterministic condition on the “ground-truth” clusters (similar to conditions assumed in earlier analyses of clustering) in the underlying data. These are the first provable guarantees on t-SNE for constructing good data visualizations.  We show that our deterministic condition is satisfied by considerably general probabilistic generative models for clusterable data such as mixtures of well-separated log-concave distributions. Finally, we give theoretical evidence that t-SNE provably succeeds in \emph{partially} recovering cluster structure even when the above deterministic condition is not met.

APA


Arora, S., Hu, W. & Kothari, P.K.. (2018). An Analysis of the t-SNE Algorithm for Data Visualization. Proceedings of the 31st  Conference On Learning Theory, in Proceedings of Machine Learning Research 75:1455-1462 Available from https://proceedings.mlr.press/v75/arora18a.html.

Related Material

Download PDF