A Computation-Efficient Method of Measuring Dataset Quality based on the Coverage of the Dataset

Beomjun Kim, Jaehwan Kim, Kangyeon Kim, Sunwoo Kim, Heejin Ahn
Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, PMLR 258:4744-4752, 2025.

Abstract

Evaluating dataset quality is an essential task, as the performance of artificial intelligence (AI) systems heavily depends on it. A traditional method for evaluating dataset quality involves training an AI model on the dataset and testing it on a separate test set. However, this approach requires significant computational time. In this paper, we propose a computationally efficient method for quantifying dataset quality. Specifically, our method measures how well the dataset covers the input probability distribution, ensuring that a high-quality dataset minimizes out-of-distribution inputs. We present a GPU-accelerated algorithm for approximately implementing the proposed method. We highlight three applications of our approach. First, it can evaluate the impact of data management practices, such as data cleaning and core set selection. We experimentally demonstrate that the quality assessment provided by our method strongly correlates with the traditional approach, achieving an $R^2 \geq 0.985$ in most cases while being 60-1200 times faster. Second, it can monitor the quality of continuously growing datasets with computation time proportional to the added data size. Finally, our method can estimate the performance of traditional methods for large datasets.

Cite this Paper


BibTeX
@InProceedings{pmlr-v258-kim25f, title = {A Computation-Efficient Method of Measuring Dataset Quality based on the Coverage of the Dataset}, author = {Kim, Beomjun and Kim, Jaehwan and Kim, Kangyeon and Kim, Sunwoo and Ahn, Heejin}, booktitle = {Proceedings of The 28th International Conference on Artificial Intelligence and Statistics}, pages = {4744--4752}, year = {2025}, editor = {Li, Yingzhen and Mandt, Stephan and Agrawal, Shipra and Khan, Emtiyaz}, volume = {258}, series = {Proceedings of Machine Learning Research}, month = {03--05 May}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v258/main/assets/kim25f/kim25f.pdf}, url = {https://proceedings.mlr.press/v258/kim25f.html}, abstract = {Evaluating dataset quality is an essential task, as the performance of artificial intelligence (AI) systems heavily depends on it. A traditional method for evaluating dataset quality involves training an AI model on the dataset and testing it on a separate test set. However, this approach requires significant computational time. In this paper, we propose a computationally efficient method for quantifying dataset quality. Specifically, our method measures how well the dataset covers the input probability distribution, ensuring that a high-quality dataset minimizes out-of-distribution inputs. We present a GPU-accelerated algorithm for approximately implementing the proposed method. We highlight three applications of our approach. First, it can evaluate the impact of data management practices, such as data cleaning and core set selection. We experimentally demonstrate that the quality assessment provided by our method strongly correlates with the traditional approach, achieving an $R^2 \geq 0.985$ in most cases while being 60-1200 times faster. Second, it can monitor the quality of continuously growing datasets with computation time proportional to the added data size. Finally, our method can estimate the performance of traditional methods for large datasets.} }
Endnote
%0 Conference Paper %T A Computation-Efficient Method of Measuring Dataset Quality based on the Coverage of the Dataset %A Beomjun Kim %A Jaehwan Kim %A Kangyeon Kim %A Sunwoo Kim %A Heejin Ahn %B Proceedings of The 28th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2025 %E Yingzhen Li %E Stephan Mandt %E Shipra Agrawal %E Emtiyaz Khan %F pmlr-v258-kim25f %I PMLR %P 4744--4752 %U https://proceedings.mlr.press/v258/kim25f.html %V 258 %X Evaluating dataset quality is an essential task, as the performance of artificial intelligence (AI) systems heavily depends on it. A traditional method for evaluating dataset quality involves training an AI model on the dataset and testing it on a separate test set. However, this approach requires significant computational time. In this paper, we propose a computationally efficient method for quantifying dataset quality. Specifically, our method measures how well the dataset covers the input probability distribution, ensuring that a high-quality dataset minimizes out-of-distribution inputs. We present a GPU-accelerated algorithm for approximately implementing the proposed method. We highlight three applications of our approach. First, it can evaluate the impact of data management practices, such as data cleaning and core set selection. We experimentally demonstrate that the quality assessment provided by our method strongly correlates with the traditional approach, achieving an $R^2 \geq 0.985$ in most cases while being 60-1200 times faster. Second, it can monitor the quality of continuously growing datasets with computation time proportional to the added data size. Finally, our method can estimate the performance of traditional methods for large datasets.
APA
Kim, B., Kim, J., Kim, K., Kim, S. & Ahn, H.. (2025). A Computation-Efficient Method of Measuring Dataset Quality based on the Coverage of the Dataset. Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 258:4744-4752 Available from https://proceedings.mlr.press/v258/kim25f.html.

Related Material