Using Unsupervised Learning to Guide Resampling in Imbalanced Data Sets

Adam Nickerson, Nathalie Japkowicz, Evangelos E. Milios
Proceedings of the Eighth International Workshop on Artificial Intelligence and Statistics, PMLR R3:224-228, 2001.

Abstract

The class imbalance problem causes a classifier to over-fit the data belonging to the class with the greatest number of training examples. The purpose of this paper is to argue that methods that equalize class membership are not as effective as possible when applied blindly and that improvements can be obtained by adjusting for the within-class imbalance. A guided resampling technique is proposed and tested within a simpler letter recognition domain and a more difficult text classification domain. A fast unsupervised clustering technique, Principal Direction Divisive Partitioning (PDDP), is used to determine the internal characteristics of each class. The performance improvement in categories that suffer from a large between-class imbalance (few positive examples) are shown to be improved when using the guided resampling method.

Cite this Paper


BibTeX
@InProceedings{pmlr-vR3-nickerson01a, title = {Using Unsupervised Learning to Guide Resampling in Imbalanced Data Sets}, author = {Nickerson, Adam and Japkowicz, Nathalie and Milios, Evangelos E.}, booktitle = {Proceedings of the Eighth International Workshop on Artificial Intelligence and Statistics}, pages = {224--228}, year = {2001}, editor = {Richardson, Thomas S. and Jaakkola, Tommi S.}, volume = {R3}, series = {Proceedings of Machine Learning Research}, month = {04--07 Jan}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/r3/nickerson01a/nickerson01a.pdf}, url = {http://proceedings.mlr.press/r3/nickerson01a.html}, abstract = {The class imbalance problem causes a classifier to over-fit the data belonging to the class with the greatest number of training examples. The purpose of this paper is to argue that methods that equalize class membership are not as effective as possible when applied blindly and that improvements can be obtained by adjusting for the within-class imbalance. A guided resampling technique is proposed and tested within a simpler letter recognition domain and a more difficult text classification domain. A fast unsupervised clustering technique, Principal Direction Divisive Partitioning (PDDP), is used to determine the internal characteristics of each class. The performance improvement in categories that suffer from a large between-class imbalance (few positive examples) are shown to be improved when using the guided resampling method.}, note = {Reissued by PMLR on 31 March 2021.} }
Endnote
%0 Conference Paper %T Using Unsupervised Learning to Guide Resampling in Imbalanced Data Sets %A Adam Nickerson %A Nathalie Japkowicz %A Evangelos E. Milios %B Proceedings of the Eighth International Workshop on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2001 %E Thomas S. Richardson %E Tommi S. Jaakkola %F pmlr-vR3-nickerson01a %I PMLR %P 224--228 %U http://proceedings.mlr.press/r3/nickerson01a.html %V R3 %X The class imbalance problem causes a classifier to over-fit the data belonging to the class with the greatest number of training examples. The purpose of this paper is to argue that methods that equalize class membership are not as effective as possible when applied blindly and that improvements can be obtained by adjusting for the within-class imbalance. A guided resampling technique is proposed and tested within a simpler letter recognition domain and a more difficult text classification domain. A fast unsupervised clustering technique, Principal Direction Divisive Partitioning (PDDP), is used to determine the internal characteristics of each class. The performance improvement in categories that suffer from a large between-class imbalance (few positive examples) are shown to be improved when using the guided resampling method. %Z Reissued by PMLR on 31 March 2021.
APA
Nickerson, A., Japkowicz, N. & Milios, E.E.. (2001). Using Unsupervised Learning to Guide Resampling in Imbalanced Data Sets. Proceedings of the Eighth International Workshop on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research R3:224-228 Available from http://proceedings.mlr.press/r3/nickerson01a.html. Reissued by PMLR on 31 March 2021.

Related Material