$h$-DBSCAN: A simple fast DBSCAN algorithm for big data

Shaoyuan Weng; Jin Gou; Zongwen Fan

$h$-DBSCAN: A simple fast DBSCAN algorithm for big data

Shaoyuan Weng, Jin Gou, Zongwen Fan

Proceedings of The 13th Asian Conference on Machine Learning, PMLR 157:81-96, 2021.

Abstract

DBSCAN is a classical clustering algorithm, which can identify different shapes and isolate noisy patterns from a dataset. Despite the above advantages, the bottleneck of DBSCAN is its computation time for high dimensional datasets. This work, thus, presents a simple and fast method to improve the efficiency of DBSCAN algorithm. We reduce the execution time in two aspects. The first one is to reduce the number of points presented to DBSCAN and the second one is to apply the HNSW technique instead of the linear search structure for improving its efficiency. The experimental results show that our proposed algorithm can greatly improve the clustering speed without losing or even obtaining better accuracy, especially for large-scale datasets.

Cite this Paper

BibTeX


@InProceedings{pmlr-v157-weng21a,
  title = 	 {$h$-DBSCAN: A simple fast DBSCAN algorithm for big data},
  author =       {Weng, Shaoyuan and Gou, Jin and Fan, Zongwen},
  booktitle = 	 {Proceedings of The 13th Asian Conference on Machine Learning},
  pages = 	 {81--96},
  year = 	 {2021},
  editor = 	 {Balasubramanian, Vineeth N. and Tsang, Ivor},
  volume = 	 {157},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {17--19 Nov},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v157/weng21a/weng21a.pdf},
  url = 	 {https://proceedings.mlr.press/v157/weng21a.html},
  abstract = 	 {DBSCAN is a classical clustering algorithm, which can identify different shapes and isolate noisy patterns from a dataset. Despite the above advantages, the bottleneck of DBSCAN is its computation time for high dimensional datasets. This work, thus, presents a simple and fast method to improve the efficiency of DBSCAN algorithm. We reduce the execution time in two aspects. The first one is to reduce the number of points presented to DBSCAN and the second one is to apply the HNSW technique instead of the linear search structure for improving its efficiency. The experimental results show that our proposed algorithm can greatly improve the clustering speed without losing or even obtaining better accuracy, especially for large-scale datasets. }
}

Endnote

%0 Conference Paper
%T $h$-DBSCAN: A simple fast DBSCAN algorithm for big data
%A Shaoyuan Weng
%A Jin Gou
%A Zongwen Fan
%B Proceedings of The 13th Asian Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2021
%E Vineeth N. Balasubramanian
%E Ivor Tsang	
%F pmlr-v157-weng21a
%I PMLR
%P 81--96
%U https://proceedings.mlr.press/v157/weng21a.html
%V 157
%X DBSCAN is a classical clustering algorithm, which can identify different shapes and isolate noisy patterns from a dataset. Despite the above advantages, the bottleneck of DBSCAN is its computation time for high dimensional datasets. This work, thus, presents a simple and fast method to improve the efficiency of DBSCAN algorithm. We reduce the execution time in two aspects. The first one is to reduce the number of points presented to DBSCAN and the second one is to apply the HNSW technique instead of the linear search structure for improving its efficiency. The experimental results show that our proposed algorithm can greatly improve the clustering speed without losing or even obtaining better accuracy, especially for large-scale datasets.

APA


Weng, S., Gou, J. & Fan, Z.. (2021). $h$-DBSCAN: A simple fast DBSCAN algorithm for big data. Proceedings of The 13th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 157:81-96 Available from https://proceedings.mlr.press/v157/weng21a.html.

Related Material

Download PDF