Rk-means: Fast Clustering for Relational Data

Ryan Curtin, Benjamin Moseley, Hung Ngo, XuanLong Nguyen, Dan Olteanu, Maximilian Schleich
Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, PMLR 108:2742-2752, 2020.

Abstract

Conventional machine learning algorithms cannot be applied until a data matrix is available to process. When the data matrix needs to be obtained from a relational database via a feature extraction query, the computation cost can be prohibitive, as the data matrix may be (much) larger than the total input relation size. This paper introduces Rk-means, or relational k-means algorithm, for clustering relational data tuples without having to access the full data matrix. As such, we avoid having to run the expensive feature extraction query and storing its output. Our algorithm leverages the underlying structures in relational data. It involves construction of a small grid coreset of the data matrix for subsequent cluster construction. This gives a constant approximation for the k-means objective, while having asymptotic runtime improvements over standard approaches of first running the database query and then clustering. Empirical results show orders-of-magnitude speedup, and Rk-means can run faster on the database than even just computing the data matrix.

Cite this Paper


BibTeX
@InProceedings{pmlr-v108-curtin20a, title = {Rk-means: Fast Clustering for Relational Data}, author = {Curtin, Ryan and Moseley, Benjamin and Ngo, Hung and Nguyen, XuanLong and Olteanu, Dan and Schleich, Maximilian}, booktitle = {Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics}, pages = {2742--2752}, year = {2020}, editor = {Chiappa, Silvia and Calandra, Roberto}, volume = {108}, series = {Proceedings of Machine Learning Research}, month = {26--28 Aug}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v108/curtin20a/curtin20a.pdf}, url = {https://proceedings.mlr.press/v108/curtin20a.html}, abstract = {Conventional machine learning algorithms cannot be applied until a data matrix is available to process. When the data matrix needs to be obtained from a relational database via a feature extraction query, the computation cost can be prohibitive, as the data matrix may be (much) larger than the total input relation size. This paper introduces Rk-means, or relational k-means algorithm, for clustering relational data tuples without having to access the full data matrix. As such, we avoid having to run the expensive feature extraction query and storing its output. Our algorithm leverages the underlying structures in relational data. It involves construction of a small grid coreset of the data matrix for subsequent cluster construction. This gives a constant approximation for the k-means objective, while having asymptotic runtime improvements over standard approaches of first running the database query and then clustering. Empirical results show orders-of-magnitude speedup, and Rk-means can run faster on the database than even just computing the data matrix.} }
Endnote
%0 Conference Paper %T Rk-means: Fast Clustering for Relational Data %A Ryan Curtin %A Benjamin Moseley %A Hung Ngo %A XuanLong Nguyen %A Dan Olteanu %A Maximilian Schleich %B Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2020 %E Silvia Chiappa %E Roberto Calandra %F pmlr-v108-curtin20a %I PMLR %P 2742--2752 %U https://proceedings.mlr.press/v108/curtin20a.html %V 108 %X Conventional machine learning algorithms cannot be applied until a data matrix is available to process. When the data matrix needs to be obtained from a relational database via a feature extraction query, the computation cost can be prohibitive, as the data matrix may be (much) larger than the total input relation size. This paper introduces Rk-means, or relational k-means algorithm, for clustering relational data tuples without having to access the full data matrix. As such, we avoid having to run the expensive feature extraction query and storing its output. Our algorithm leverages the underlying structures in relational data. It involves construction of a small grid coreset of the data matrix for subsequent cluster construction. This gives a constant approximation for the k-means objective, while having asymptotic runtime improvements over standard approaches of first running the database query and then clustering. Empirical results show orders-of-magnitude speedup, and Rk-means can run faster on the database than even just computing the data matrix.
APA
Curtin, R., Moseley, B., Ngo, H., Nguyen, X., Olteanu, D. & Schleich, M.. (2020). Rk-means: Fast Clustering for Relational Data. Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 108:2742-2752 Available from https://proceedings.mlr.press/v108/curtin20a.html.

Related Material