On Efficient Low Distortion Ultrametric Embedding

Vincent Cohen-Addad; Karthik C. S.; Guillaume Lagarde

On Efficient Low Distortion Ultrametric Embedding

Vincent Cohen-Addad, Karthik C. S., Guillaume Lagarde

Proceedings of the 37th International Conference on Machine Learning, PMLR 119:2078-2088, 2020.

Abstract

A classic problem in unsupervised learning and data analysis is to find simpler and easy-to-visualize representations of the data that preserve its essential properties. A widely-used method to preserve the underlying hierarchical structure of the data while reducing its complexity is to find an embedding of the data into a tree or an ultrametric, but computing such an embedding on a data set of

$n$ points in

$\Omega(\log n)$ dimensions incurs a quite prohibitive running time of

$\Theta(n^2)$ . In this paper, we provide a new algorithm which takes as input a set of points

$P$ in

$\R^d$ , and for every

$c\ge 1$ , runs in time

$n^{1+\frac{\rho}{c^2}}$ (for some universal constant

$\rho>1$ ) to output an ultrametric

$\Delta$ such that for any two points

$u,v$ in

$P$ , we have

$\Delta(u,v)$ is within a multiplicative factor of

$5c$ to the distance between

$u$ and

$v$ in the best ultrametric representation of

$P$ . Here, the best ultrametric is the ultrametric

$\tilde\Delta$ that minimizes the maximum distance distortion with respect to the

$\ell_2$ distance, namely that minimizes

$\underset{u,v \in P}{\max} \nicefrac{\tilde\Delta(u,v)}{\|u-v\|_2}$ . We complement the above result by showing that under popular complexity theoretic assumptions, for every constant

$\varepsilon>0$ , no algorithm with running time

$n^{2-\varepsilon}$ can distinguish between inputs in

$\ell_\infty$ -metric that admit isometric embedding and those that incur a distortion of

$\nicefrac{3}{2}$ . Finally, we present empirical evaluation on classic machine learning datasets and show that the output of our algorithm is comparable to the output of the linkage algorithms while achieving a much faster running time.

Cite this Paper

BibTeX


@InProceedings{pmlr-v119-cohen-addad20a,
  title = 	 {On Efficient Low Distortion Ultrametric Embedding},
  author =       {Cohen-Addad, Vincent and S., Karthik C. and Lagarde, Guillaume},
  booktitle = 	 {Proceedings of the 37th International Conference on Machine Learning},
  pages = 	 {2078--2088},
  year = 	 {2020},
  editor = 	 {III, Hal Daumé and Singh, Aarti},
  volume = 	 {119},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--18 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v119/cohen-addad20a/cohen-addad20a.pdf},
  url = 	 {https://proceedings.mlr.press/v119/cohen-addad20a.html},
  abstract = 	 {A classic problem in unsupervised learning and data analysis is to find simpler and easy-to-visualize representations of the data that preserve its essential properties. A widely-used method to preserve the underlying hierarchical structure of the data while reducing its complexity is to find an embedding of the data into a tree or an ultrametric, but computing such an embedding on a data set of $n$ points in $\Omega(\log n)$ dimensions incurs a quite prohibitive running time of $\Theta(n^2)$. In this paper, we provide a new algorithm which takes as input a set of points $P$ in $\R^d$, and for every $c\ge 1$, runs in time $n^{1+\frac{\rho}{c^2}}$ (for some universal constant $\rho>1$) to output an ultrametric $\Delta$ such that for any two points $u,v$ in $P$, we have $\Delta(u,v)$ is within a multiplicative factor of $5c$ to the distance between $u$ and $v$ in the best ultrametric representation of $P$. Here, the best ultrametric is the ultrametric $\tilde\Delta$ that minimizes the maximum distance distortion with respect to the $\ell_2$ distance, namely that minimizes $\underset{u,v \in P}{\max} \nicefrac{\tilde\Delta(u,v)}{\|u-v\|_2}$. We complement the above result by showing that under popular complexity theoretic assumptions, for every constant $\varepsilon>0$, no algorithm with running time $n^{2-\varepsilon}$ can distinguish between inputs in $\ell_\infty$-metric that admit isometric embedding and those that incur a distortion of $\nicefrac{3}{2}$. Finally, we present empirical evaluation on classic machine learning datasets and show that the output of our algorithm is comparable to the output of the linkage algorithms while achieving a much faster running time.}
}

Endnote

%0 Conference Paper
%T On Efficient Low Distortion Ultrametric Embedding
%A Vincent Cohen-Addad
%A Karthik C. S.
%A Guillaume Lagarde
%B Proceedings of the 37th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2020
%E Hal Daumé III
%E Aarti Singh	
%F pmlr-v119-cohen-addad20a
%I PMLR
%P 2078--2088
%U https://proceedings.mlr.press/v119/cohen-addad20a.html
%V 119
%X A classic problem in unsupervised learning and data analysis is to find simpler and easy-to-visualize representations of the data that preserve its essential properties. A widely-used method to preserve the underlying hierarchical structure of the data while reducing its complexity is to find an embedding of the data into a tree or an ultrametric, but computing such an embedding on a data set of $n$ points in $\Omega(\log n)$ dimensions incurs a quite prohibitive running time of $\Theta(n^2)$. In this paper, we provide a new algorithm which takes as input a set of points $P$ in $\R^d$, and for every $c\ge 1$, runs in time $n^{1+\frac{\rho}{c^2}}$ (for some universal constant $\rho>1$) to output an ultrametric $\Delta$ such that for any two points $u,v$ in $P$, we have $\Delta(u,v)$ is within a multiplicative factor of $5c$ to the distance between $u$ and $v$ in the best ultrametric representation of $P$. Here, the best ultrametric is the ultrametric $\tilde\Delta$ that minimizes the maximum distance distortion with respect to the $\ell_2$ distance, namely that minimizes $\underset{u,v \in P}{\max} \nicefrac{\tilde\Delta(u,v)}{\|u-v\|_2}$. We complement the above result by showing that under popular complexity theoretic assumptions, for every constant $\varepsilon>0$, no algorithm with running time $n^{2-\varepsilon}$ can distinguish between inputs in $\ell_\infty$-metric that admit isometric embedding and those that incur a distortion of $\nicefrac{3}{2}$. Finally, we present empirical evaluation on classic machine learning datasets and show that the output of our algorithm is comparable to the output of the linkage algorithms while achieving a much faster running time.

APA


Cohen-Addad, V., S., K.C. & Lagarde, G.. (2020). On Efficient Low Distortion Ultrametric Embedding. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:2078-2088 Available from https://proceedings.mlr.press/v119/cohen-addad20a.html.

Related Material

Download PDF