Data complexity and classification accuracy correlation in oversampling algorithms

Joanna Komorniczak, Paweł Ksieniewicz, Michał Woźniak
Proceedings of the Fourth International Workshop on Learning with Imbalanced Domains: Theory and Applications, PMLR 183:175-186, 2022.

Abstract

Purpose: This work proposes the hypothesis that data oversampling may lead to dataset simplification according to selected data difficulty metrics and that such simplification positively affects the quality of selected classifier learning methods. Methods: A set of computer experiments was performed for 47 benchmark datasets to make the hypothesis plausible. The experiments considered five oversampling methods, five classifiers, and 22 metrics for data difficulty assessment. The experiments aim to establish: (a) whether there is a relationship between resampling and change in the difficulty of the training data and (b) whether there is a relationship between changes in the values of training set difficulty metrics and classification quality. Results: Based on the obtained results, the research hypothesis was confirmed. It was indicated which measures correlate with selected classifiers. The experiments showed the relationship between the change of assessed difficulty measures after oversampling and the classification quality of selected models. Conclusion: The obtained results allow using the selected measures to predict whether a given oversampling method leads to favorable modifications of the learning set for a given type of classifier. Showed relationship between difficulty measures and classification will allow using the mentioned measures as a learning criterion. For example, guided oversampling can treat the modification of the learning set as an optimization task. During the oversampling process, no estimation of classification quality metrics will be required, but only an evaluation of the training set difficulty. This may contribute to the proposition of computationally efficient methods.

Cite this Paper


BibTeX
@InProceedings{pmlr-v183-komorniczak22a, title = {Data complexity and classification accuracy correlation in oversampling algorithms}, author = {Komorniczak, Joanna and Ksieniewicz, Pawe\l{} and Wo\'{z}niak, Micha\l{}}, booktitle = {Proceedings of the Fourth International Workshop on Learning with Imbalanced Domains: Theory and Applications}, pages = {175--186}, year = {2022}, editor = {Moniz, Nuno and Branco, Paula and Torgo, Luís and Japkowicz, Nathalie and Wozniak, Michal and Wang, Shuo}, volume = {183}, series = {Proceedings of Machine Learning Research}, month = {23 Sep}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v183/komorniczak22a/komorniczak22a.pdf}, url = {https://proceedings.mlr.press/v183/komorniczak22a.html}, abstract = {Purpose: This work proposes the hypothesis that data oversampling may lead to dataset simplification according to selected data difficulty metrics and that such simplification positively affects the quality of selected classifier learning methods. Methods: A set of computer experiments was performed for 47 benchmark datasets to make the hypothesis plausible. The experiments considered five oversampling methods, five classifiers, and 22 metrics for data difficulty assessment. The experiments aim to establish: (a) whether there is a relationship between resampling and change in the difficulty of the training data and (b) whether there is a relationship between changes in the values of training set difficulty metrics and classification quality. Results: Based on the obtained results, the research hypothesis was confirmed. It was indicated which measures correlate with selected classifiers. The experiments showed the relationship between the change of assessed difficulty measures after oversampling and the classification quality of selected models. Conclusion: The obtained results allow using the selected measures to predict whether a given oversampling method leads to favorable modifications of the learning set for a given type of classifier. Showed relationship between difficulty measures and classification will allow using the mentioned measures as a learning criterion. For example, guided oversampling can treat the modification of the learning set as an optimization task. During the oversampling process, no estimation of classification quality metrics will be required, but only an evaluation of the training set difficulty. This may contribute to the proposition of computationally efficient methods.} }
Endnote
%0 Conference Paper %T Data complexity and classification accuracy correlation in oversampling algorithms %A Joanna Komorniczak %A Paweł Ksieniewicz %A Michał Woźniak %B Proceedings of the Fourth International Workshop on Learning with Imbalanced Domains: Theory and Applications %C Proceedings of Machine Learning Research %D 2022 %E Nuno Moniz %E Paula Branco %E Luís Torgo %E Nathalie Japkowicz %E Michal Wozniak %E Shuo Wang %F pmlr-v183-komorniczak22a %I PMLR %P 175--186 %U https://proceedings.mlr.press/v183/komorniczak22a.html %V 183 %X Purpose: This work proposes the hypothesis that data oversampling may lead to dataset simplification according to selected data difficulty metrics and that such simplification positively affects the quality of selected classifier learning methods. Methods: A set of computer experiments was performed for 47 benchmark datasets to make the hypothesis plausible. The experiments considered five oversampling methods, five classifiers, and 22 metrics for data difficulty assessment. The experiments aim to establish: (a) whether there is a relationship between resampling and change in the difficulty of the training data and (b) whether there is a relationship between changes in the values of training set difficulty metrics and classification quality. Results: Based on the obtained results, the research hypothesis was confirmed. It was indicated which measures correlate with selected classifiers. The experiments showed the relationship between the change of assessed difficulty measures after oversampling and the classification quality of selected models. Conclusion: The obtained results allow using the selected measures to predict whether a given oversampling method leads to favorable modifications of the learning set for a given type of classifier. Showed relationship between difficulty measures and classification will allow using the mentioned measures as a learning criterion. For example, guided oversampling can treat the modification of the learning set as an optimization task. During the oversampling process, no estimation of classification quality metrics will be required, but only an evaluation of the training set difficulty. This may contribute to the proposition of computationally efficient methods.
APA
Komorniczak, J., Ksieniewicz, P. & Woźniak, M.. (2022). Data complexity and classification accuracy correlation in oversampling algorithms. Proceedings of the Fourth International Workshop on Learning with Imbalanced Domains: Theory and Applications, in Proceedings of Machine Learning Research 183:175-186 Available from https://proceedings.mlr.press/v183/komorniczak22a.html.

Related Material