DBSCAN++: Towards fast and scalable density clustering

Jennifer Jang; Heinrich Jiang

DBSCAN++: Towards fast and scalable density clustering

Jennifer Jang, Heinrich Jiang

Proceedings of the 36th International Conference on Machine Learning, PMLR 97:3019-3029, 2019.

Abstract

DBSCAN is a classical density-based clustering procedure with tremendous practical relevance. However, DBSCAN implicitly needs to compute the empirical density for each sample point, leading to a quadratic worst-case time complexity, which is too slow on large datasets. We propose DBSCAN++, a simple modification of DBSCAN which only requires computing the densities for a chosen subset of points. We show empirically that, compared to traditional DBSCAN, DBSCAN++ can provide not only competitive performance but also added robustness in the bandwidth hyperparameter while taking a fraction of the runtime. We also present statistical consistency guarantees showing the trade-off between computational cost and estimation rates. Surprisingly, up to a certain point, we can enjoy the same estimation rates while lowering computational cost, showing that DBSCAN++ is a sub-quadratic algorithm that attains minimax optimal rates for level-set estimation, a quality that may be of independent interest.

Cite this Paper

BibTeX


@InProceedings{pmlr-v97-jang19a,
  title = 	 {{DBSCAN}++: Towards fast and scalable density clustering},
  author =       {Jang, Jennifer and Jiang, Heinrich},
  booktitle = 	 {Proceedings of the 36th International Conference on Machine Learning},
  pages = 	 {3019--3029},
  year = 	 {2019},
  editor = 	 {Chaudhuri, Kamalika and Salakhutdinov, Ruslan},
  volume = 	 {97},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {09--15 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v97/jang19a/jang19a.pdf},
  url = 	 {https://proceedings.mlr.press/v97/jang19a.html},
  abstract = 	 {DBSCAN is a classical density-based clustering procedure with tremendous practical relevance. However, DBSCAN implicitly needs to compute the empirical density for each sample point, leading to a quadratic worst-case time complexity, which is too slow on large datasets. We propose DBSCAN++, a simple modification of DBSCAN which only requires computing the densities for a chosen subset of points. We show empirically that, compared to traditional DBSCAN, DBSCAN++ can provide not only competitive performance but also added robustness in the bandwidth hyperparameter while taking a fraction of the runtime. We also present statistical consistency guarantees showing the trade-off between computational cost and estimation rates. Surprisingly, up to a certain point, we can enjoy the same estimation rates while lowering computational cost, showing that DBSCAN++ is a sub-quadratic algorithm that attains minimax optimal rates for level-set estimation, a quality that may be of independent interest.}
}

Endnote

%0 Conference Paper
%T DBSCAN++: Towards fast and scalable density clustering
%A Jennifer Jang
%A Heinrich Jiang
%B Proceedings of the 36th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2019
%E Kamalika Chaudhuri
%E Ruslan Salakhutdinov	
%F pmlr-v97-jang19a
%I PMLR
%P 3019--3029
%U https://proceedings.mlr.press/v97/jang19a.html
%V 97
%X DBSCAN is a classical density-based clustering procedure with tremendous practical relevance. However, DBSCAN implicitly needs to compute the empirical density for each sample point, leading to a quadratic worst-case time complexity, which is too slow on large datasets. We propose DBSCAN++, a simple modification of DBSCAN which only requires computing the densities for a chosen subset of points. We show empirically that, compared to traditional DBSCAN, DBSCAN++ can provide not only competitive performance but also added robustness in the bandwidth hyperparameter while taking a fraction of the runtime. We also present statistical consistency guarantees showing the trade-off between computational cost and estimation rates. Surprisingly, up to a certain point, we can enjoy the same estimation rates while lowering computational cost, showing that DBSCAN++ is a sub-quadratic algorithm that attains minimax optimal rates for level-set estimation, a quality that may be of independent interest.

APA


Jang, J. & Jiang, H.. (2019). DBSCAN++: Towards fast and scalable density clustering. Proceedings of the 36th International Conference on Machine Learning, in Proceedings of Machine Learning Research 97:3019-3029 Available from https://proceedings.mlr.press/v97/jang19a.html.

DBSCAN++: Towards fast and scalable density clustering

Abstract

Cite this Paper

Related Material