On Oversampling via Generative Adversarial Networks under Different Data Difficulty Factors

Ehsan Nazari, Paula Branco
Proceedings of the Third International Workshop on Learning with Imbalanced Domains: Theory and Applications, PMLR 154:76-89, 2021.

Abstract

Over the last two decades, several approaches have been proposed to tackle the class imbalance problem which is characterized by the inability of a learner to focus on a relevant but scarcely represented class. The generation of synthetic examples to oversample the training set and thus force the learner to focus on the important cases is one of such solutions. Recently, generative adversarial networks (GANs) started to be explored as an oversampling alternative due to their capability of generating samples from an implicit distribution. Still, data difficulty factors such as class overlap, data dimensionality or sample size, and were shown to also negatively impact the learners performance under an imbalance setting. The ability of GANs to deal with the imbalance problem and other data difficulty factors has not yet been assessed. The main goal of this paper is to understand how data difficulty factors impact the performance of GANs when they are used as an oversampling method. Namely, we study the performance of conditioned GANs (CGANs) in an image dataset with controlled levels of the following data difficulty factors: sample size, data dimensionality, class overlap and imbalance ratio. We show that CGANs are effective for tackling tasks with multiple data difficulty factors, exhibiting increased gains on the most difficult tasks.

Cite this Paper


BibTeX
@InProceedings{pmlr-v154-nazari21a, title = {On Oversampling via Generative Adversarial Networks under Different Data Difficulty Factors}, author = {Nazari, Ehsan and Branco, Paula}, booktitle = {Proceedings of the Third International Workshop on Learning with Imbalanced Domains: Theory and Applications}, pages = {76--89}, year = {2021}, editor = {Moniz, Nuno and Branco, Paula and Torgo, Luis and Japkowicz, Nathalie and Woźniak, Michał and Wang, Shuo}, volume = {154}, series = {Proceedings of Machine Learning Research}, month = {17 Sep}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v154/nazari21a/nazari21a.pdf}, url = {https://proceedings.mlr.press/v154/nazari21a.html}, abstract = {Over the last two decades, several approaches have been proposed to tackle the class imbalance problem which is characterized by the inability of a learner to focus on a relevant but scarcely represented class. The generation of synthetic examples to oversample the training set and thus force the learner to focus on the important cases is one of such solutions. Recently, generative adversarial networks (GANs) started to be explored as an oversampling alternative due to their capability of generating samples from an implicit distribution. Still, data difficulty factors such as class overlap, data dimensionality or sample size, and were shown to also negatively impact the learners performance under an imbalance setting. The ability of GANs to deal with the imbalance problem and other data difficulty factors has not yet been assessed. The main goal of this paper is to understand how data difficulty factors impact the performance of GANs when they are used as an oversampling method. Namely, we study the performance of conditioned GANs (CGANs) in an image dataset with controlled levels of the following data difficulty factors: sample size, data dimensionality, class overlap and imbalance ratio. We show that CGANs are effective for tackling tasks with multiple data difficulty factors, exhibiting increased gains on the most difficult tasks.} }
Endnote
%0 Conference Paper %T On Oversampling via Generative Adversarial Networks under Different Data Difficulty Factors %A Ehsan Nazari %A Paula Branco %B Proceedings of the Third International Workshop on Learning with Imbalanced Domains: Theory and Applications %C Proceedings of Machine Learning Research %D 2021 %E Nuno Moniz %E Paula Branco %E Luis Torgo %E Nathalie Japkowicz %E Michał Woźniak %E Shuo Wang %F pmlr-v154-nazari21a %I PMLR %P 76--89 %U https://proceedings.mlr.press/v154/nazari21a.html %V 154 %X Over the last two decades, several approaches have been proposed to tackle the class imbalance problem which is characterized by the inability of a learner to focus on a relevant but scarcely represented class. The generation of synthetic examples to oversample the training set and thus force the learner to focus on the important cases is one of such solutions. Recently, generative adversarial networks (GANs) started to be explored as an oversampling alternative due to their capability of generating samples from an implicit distribution. Still, data difficulty factors such as class overlap, data dimensionality or sample size, and were shown to also negatively impact the learners performance under an imbalance setting. The ability of GANs to deal with the imbalance problem and other data difficulty factors has not yet been assessed. The main goal of this paper is to understand how data difficulty factors impact the performance of GANs when they are used as an oversampling method. Namely, we study the performance of conditioned GANs (CGANs) in an image dataset with controlled levels of the following data difficulty factors: sample size, data dimensionality, class overlap and imbalance ratio. We show that CGANs are effective for tackling tasks with multiple data difficulty factors, exhibiting increased gains on the most difficult tasks.
APA
Nazari, E. & Branco, P.. (2021). On Oversampling via Generative Adversarial Networks under Different Data Difficulty Factors. Proceedings of the Third International Workshop on Learning with Imbalanced Domains: Theory and Applications, in Proceedings of Machine Learning Research 154:76-89 Available from https://proceedings.mlr.press/v154/nazari21a.html.

Related Material