Performance and model complexity on imbalanced datasets using resampling and cost-sensitive algorithms

Jairo da Silva Freitas Junior, Paulo Henrique Pisani
Proceedings of the Fourth International Workshop on Learning with Imbalanced Domains: Theory and Applications, PMLR 183:83-97, 2022.

Abstract

Imbalanced datasets occur across industries, and many applications with high economical interest deal with them, such as fraud detection and churn prediction. Resampling is commonly used to overcome the tendency of machine learning algorithms to favor the majority class error minimization, while cost-sensitive algorithms are less used. In this paper, cost-sensitive algorithms (BayesMinimumRisk, Thresholding, Cost-Sensitive Decision Tree and Cost-Sensitive Random Forest) and resampling techniques (SMOTE, SMOTETomek and TomekLinks) combined with kNN, Decision Tree, Random Forest and AdaBoost were compared on binary classification problems. The results were analyzed with respect to relative performance over different imbalance ratios. The influence of these techniques for handling the class imbalance on the machine learning models complexities was also investigated. The experiments were performed using synthetic datasets and 90 real-world datasets.

Cite this Paper


BibTeX
@InProceedings{pmlr-v183-silva-freitas-junior22a, title = {Performance and model complexity on imbalanced datasets using resampling and cost-sensitive algorithms}, author = {da Silva Freitas Junior, Jairo and Pisani, Paulo Henrique}, booktitle = {Proceedings of the Fourth International Workshop on Learning with Imbalanced Domains: Theory and Applications}, pages = {83--97}, year = {2022}, editor = {Moniz, Nuno and Branco, Paula and Torgo, Luís and Japkowicz, Nathalie and Wozniak, Michal and Wang, Shuo}, volume = {183}, series = {Proceedings of Machine Learning Research}, month = {23 Sep}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v183/silva-freitas-junior22a/silva-freitas-junior22a.pdf}, url = {https://proceedings.mlr.press/v183/silva-freitas-junior22a.html}, abstract = {Imbalanced datasets occur across industries, and many applications with high economical interest deal with them, such as fraud detection and churn prediction. Resampling is commonly used to overcome the tendency of machine learning algorithms to favor the majority class error minimization, while cost-sensitive algorithms are less used. In this paper, cost-sensitive algorithms (BayesMinimumRisk, Thresholding, Cost-Sensitive Decision Tree and Cost-Sensitive Random Forest) and resampling techniques (SMOTE, SMOTETomek and TomekLinks) combined with kNN, Decision Tree, Random Forest and AdaBoost were compared on binary classification problems. The results were analyzed with respect to relative performance over different imbalance ratios. The influence of these techniques for handling the class imbalance on the machine learning models complexities was also investigated. The experiments were performed using synthetic datasets and 90 real-world datasets.} }
Endnote
%0 Conference Paper %T Performance and model complexity on imbalanced datasets using resampling and cost-sensitive algorithms %A Jairo da Silva Freitas Junior %A Paulo Henrique Pisani %B Proceedings of the Fourth International Workshop on Learning with Imbalanced Domains: Theory and Applications %C Proceedings of Machine Learning Research %D 2022 %E Nuno Moniz %E Paula Branco %E Luís Torgo %E Nathalie Japkowicz %E Michal Wozniak %E Shuo Wang %F pmlr-v183-silva-freitas-junior22a %I PMLR %P 83--97 %U https://proceedings.mlr.press/v183/silva-freitas-junior22a.html %V 183 %X Imbalanced datasets occur across industries, and many applications with high economical interest deal with them, such as fraud detection and churn prediction. Resampling is commonly used to overcome the tendency of machine learning algorithms to favor the majority class error minimization, while cost-sensitive algorithms are less used. In this paper, cost-sensitive algorithms (BayesMinimumRisk, Thresholding, Cost-Sensitive Decision Tree and Cost-Sensitive Random Forest) and resampling techniques (SMOTE, SMOTETomek and TomekLinks) combined with kNN, Decision Tree, Random Forest and AdaBoost were compared on binary classification problems. The results were analyzed with respect to relative performance over different imbalance ratios. The influence of these techniques for handling the class imbalance on the machine learning models complexities was also investigated. The experiments were performed using synthetic datasets and 90 real-world datasets.
APA
da Silva Freitas Junior, J. & Pisani, P.H.. (2022). Performance and model complexity on imbalanced datasets using resampling and cost-sensitive algorithms. Proceedings of the Fourth International Workshop on Learning with Imbalanced Domains: Theory and Applications, in Proceedings of Machine Learning Research 183:83-97 Available from https://proceedings.mlr.press/v183/silva-freitas-junior22a.html.

Related Material