DistSMOGN: Distributed SMOGN for Imbalanced Regression Problems

Xin Yue Song, Nam Dao, Paula Branco
Proceedings of the Fourth International Workshop on Learning with Imbalanced Domains: Theory and Applications, PMLR 183:38-52, 2022.

Abstract

Imbalanced domains pose important challenges to learning systems and multiple resampling solutions have been put forward in the past two decades. More recently, it became clear that the imbalance problem arises in several other tasks including regression. Although several resampling solutions were proposed to tackle the imbalanced regression problem, with the emergence of big data this problem has become more difficult as these solutions become unfeasible due to the large volumes of data. In this paper, we propose the first distributed resampling solution for imbalanced regression that is applicable to large amounts of data. Our algorithm, DistSMOGN, is a resampling solution based on SMOGN that addresses simultaneously the imbalanced regression problem and the challenge of dealing with high volumes of data. We apply Scalable KMeans++ as way to obtain coherent cluster that maintain the spatial relationships between the rare cases. Then, we apply the well-known SMOGN method in each cluster to obtain the new synthetic examples. This method allows to generate high quality synthetic examples while dealing with the large volumes of data. Our solution is based on the MapReduce paradigm and we propose an efficient implementation on Apache Spark. The experimental evaluation carried out shows the advantages of DistSMOGN. All the code implementing DistSMOGN is freely available and can be downloaded at https://github.com/ndao1104/distributed-resampling.

Cite this Paper


BibTeX
@InProceedings{pmlr-v183-song22a, title = {DistSMOGN: Distributed SMOGN for Imbalanced Regression Problems}, author = {Song, Xin Yue and Dao, Nam and Branco, Paula}, booktitle = {Proceedings of the Fourth International Workshop on Learning with Imbalanced Domains: Theory and Applications}, pages = {38--52}, year = {2022}, editor = {Moniz, Nuno and Branco, Paula and Torgo, Luís and Japkowicz, Nathalie and Wozniak, Michal and Wang, Shuo}, volume = {183}, series = {Proceedings of Machine Learning Research}, month = {23 Sep}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v183/song22a/song22a.pdf}, url = {https://proceedings.mlr.press/v183/song22a.html}, abstract = {Imbalanced domains pose important challenges to learning systems and multiple resampling solutions have been put forward in the past two decades. More recently, it became clear that the imbalance problem arises in several other tasks including regression. Although several resampling solutions were proposed to tackle the imbalanced regression problem, with the emergence of big data this problem has become more difficult as these solutions become unfeasible due to the large volumes of data. In this paper, we propose the first distributed resampling solution for imbalanced regression that is applicable to large amounts of data. Our algorithm, DistSMOGN, is a resampling solution based on SMOGN that addresses simultaneously the imbalanced regression problem and the challenge of dealing with high volumes of data. We apply Scalable KMeans++ as way to obtain coherent cluster that maintain the spatial relationships between the rare cases. Then, we apply the well-known SMOGN method in each cluster to obtain the new synthetic examples. This method allows to generate high quality synthetic examples while dealing with the large volumes of data. Our solution is based on the MapReduce paradigm and we propose an efficient implementation on Apache Spark. The experimental evaluation carried out shows the advantages of DistSMOGN. All the code implementing DistSMOGN is freely available and can be downloaded at https://github.com/ndao1104/distributed-resampling.} }
Endnote
%0 Conference Paper %T DistSMOGN: Distributed SMOGN for Imbalanced Regression Problems %A Xin Yue Song %A Nam Dao %A Paula Branco %B Proceedings of the Fourth International Workshop on Learning with Imbalanced Domains: Theory and Applications %C Proceedings of Machine Learning Research %D 2022 %E Nuno Moniz %E Paula Branco %E Luís Torgo %E Nathalie Japkowicz %E Michal Wozniak %E Shuo Wang %F pmlr-v183-song22a %I PMLR %P 38--52 %U https://proceedings.mlr.press/v183/song22a.html %V 183 %X Imbalanced domains pose important challenges to learning systems and multiple resampling solutions have been put forward in the past two decades. More recently, it became clear that the imbalance problem arises in several other tasks including regression. Although several resampling solutions were proposed to tackle the imbalanced regression problem, with the emergence of big data this problem has become more difficult as these solutions become unfeasible due to the large volumes of data. In this paper, we propose the first distributed resampling solution for imbalanced regression that is applicable to large amounts of data. Our algorithm, DistSMOGN, is a resampling solution based on SMOGN that addresses simultaneously the imbalanced regression problem and the challenge of dealing with high volumes of data. We apply Scalable KMeans++ as way to obtain coherent cluster that maintain the spatial relationships between the rare cases. Then, we apply the well-known SMOGN method in each cluster to obtain the new synthetic examples. This method allows to generate high quality synthetic examples while dealing with the large volumes of data. Our solution is based on the MapReduce paradigm and we propose an efficient implementation on Apache Spark. The experimental evaluation carried out shows the advantages of DistSMOGN. All the code implementing DistSMOGN is freely available and can be downloaded at https://github.com/ndao1104/distributed-resampling.
APA
Song, X.Y., Dao, N. & Branco, P.. (2022). DistSMOGN: Distributed SMOGN for Imbalanced Regression Problems. Proceedings of the Fourth International Workshop on Learning with Imbalanced Domains: Theory and Applications, in Proceedings of Machine Learning Research 183:38-52 Available from https://proceedings.mlr.press/v183/song22a.html.

Related Material