Classification of Imbalanced Marketing Data with Balanced Random Sets

Vladimir Nikulin; Geoffrey J. McLachlan

Classification of Imbalanced Marketing Data with Balanced Random Sets

Vladimir Nikulin, Geoffrey J. McLachlan

Proceedings of KDD-Cup 2009 Competition, PMLR 7:89-100, 2009.

Abstract

With imbalanced data a classifier built using all of the data has the tendency to ignore the minority class. To overcome this problem, we propose to use an ensemble classifier constructed on the basis of a large number of relatively small and balanced subsets, where representatives from both patterns are to be selected randomly. As an outcome, the system produces the matrix of linear regression coefficients whose rows represent the random subsets and the columns represent the features. Based on this matrix, we make an assessment of how stable the influence of a particular feature is. It is proposed to keep in the model only features with stable influence. The final model represents an average of the base-learners, which is not necessarily a linear regression. Proper data pre-processing is very important for the effectiveness of the whole system, and it is proposed to reduce the original data to the most simple binary sparse format, which is particularly convenient for the construction of decision trees. As a result, any particular feature will be represented by several binary variables or bins, which are absolutely equivalent in terms of data structure. This property is very important and may be used for feature selection. The proposed method exploits not only contributions of particular variables to the base-learners, but also the diversity of such contributions. Test results against KDD-2009 competition datasets are presented.

Cite this Paper

BibTeX

@InProceedings{pmlr-v7-nikulin09,
  title = 	 {Classification of Imbalanced Marketing Data with Balanced Random Sets},
  author = 	 {Nikulin, Vladimir and McLachlan, Geoffrey J.},
  booktitle = 	 {Proceedings of KDD-Cup 2009 Competition},
  pages = 	 {89--100},
  year = 	 {2009},
  editor = 	 {Dror, Gideon and Boullé, Mar and Guyon, Isabelle and Lemaire, Vincent and Vogel, David},
  volume = 	 {7},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {New York, New York, USA},
  month = 	 {28 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v7/nikulin09/nikulin09.pdf},
  url = 	 {https://proceedings.mlr.press/v7/nikulin09.html},
  abstract = 	 {With imbalanced data a classifier built using all of the data has the tendency to ignore the minority class. To overcome this problem, we propose to use an ensemble classifier constructed on the basis of a large number of relatively small and balanced subsets, where representatives from both patterns are to be selected randomly. As an outcome, the system produces the matrix of linear regression coefficients whose rows represent the random subsets and the columns represent the features. Based on this matrix, we make an assessment of how stable the influence of a particular feature is. It is proposed to keep in the model only features with stable influence. The final model represents an average of the base-learners, which is not necessarily a linear regression. Proper data pre-processing is very important for the effectiveness of the whole system, and it is proposed to reduce the original data to the most simple binary sparse format, which is particularly convenient for the construction of decision trees. As a result, any particular feature will be represented by several binary variables or bins, which are absolutely equivalent in terms of data structure. This property is very important and may be used for feature selection. The proposed method exploits not only contributions of particular variables to the base-learners, but also the diversity of such contributions. Test results against KDD-2009 competition datasets are presented.}
}

Endnote

%0 Conference Paper
%T Classification of Imbalanced Marketing Data with Balanced Random Sets
%A Vladimir Nikulin
%A Geoffrey J. McLachlan
%B Proceedings of KDD-Cup 2009 Competition
%C Proceedings of Machine Learning Research
%D 2009
%E Gideon Dror
%E Mar Boullé
%E Isabelle Guyon
%E Vincent Lemaire
%E David Vogel	
%F pmlr-v7-nikulin09
%I PMLR
%P 89--100
%U https://proceedings.mlr.press/v7/nikulin09.html
%V 7
%X With imbalanced data a classifier built using all of the data has the tendency to ignore the minority class. To overcome this problem, we propose to use an ensemble classifier constructed on the basis of a large number of relatively small and balanced subsets, where representatives from both patterns are to be selected randomly. As an outcome, the system produces the matrix of linear regression coefficients whose rows represent the random subsets and the columns represent the features. Based on this matrix, we make an assessment of how stable the influence of a particular feature is. It is proposed to keep in the model only features with stable influence. The final model represents an average of the base-learners, which is not necessarily a linear regression. Proper data pre-processing is very important for the effectiveness of the whole system, and it is proposed to reduce the original data to the most simple binary sparse format, which is particularly convenient for the construction of decision trees. As a result, any particular feature will be represented by several binary variables or bins, which are absolutely equivalent in terms of data structure. This property is very important and may be used for feature selection. The proposed method exploits not only contributions of particular variables to the base-learners, but also the diversity of such contributions. Test results against KDD-2009 competition datasets are presented.

RIS

TY  - CPAPER
TI  - Classification of Imbalanced Marketing Data with Balanced Random Sets
AU  - Vladimir Nikulin
AU  - Geoffrey J. McLachlan
BT  - Proceedings of KDD-Cup 2009 Competition
DA  - 2009/12/04
ED  - Gideon Dror
ED  - Mar Boullé
ED  - Isabelle Guyon
ED  - Vincent Lemaire
ED  - David Vogel	
ID  - pmlr-v7-nikulin09
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 7
SP  - 89
EP  - 100
L1  - http://proceedings.mlr.press/v7/nikulin09/nikulin09.pdf
UR  - https://proceedings.mlr.press/v7/nikulin09.html
AB  - With imbalanced data a classifier built using all of the data has the tendency to ignore the minority class. To overcome this problem, we propose to use an ensemble classifier constructed on the basis of a large number of relatively small and balanced subsets, where representatives from both patterns are to be selected randomly. As an outcome, the system produces the matrix of linear regression coefficients whose rows represent the random subsets and the columns represent the features. Based on this matrix, we make an assessment of how stable the influence of a particular feature is. It is proposed to keep in the model only features with stable influence. The final model represents an average of the base-learners, which is not necessarily a linear regression. Proper data pre-processing is very important for the effectiveness of the whole system, and it is proposed to reduce the original data to the most simple binary sparse format, which is particularly convenient for the construction of decision trees. As a result, any particular feature will be represented by several binary variables or bins, which are absolutely equivalent in terms of data structure. This property is very important and may be used for feature selection. The proposed method exploits not only contributions of particular variables to the base-learners, but also the diversity of such contributions. Test results against KDD-2009 competition datasets are presented.
ER  -

APA

Nikulin, V. & McLachlan, G.J.. (2009). Classification of Imbalanced Marketing Data with Balanced Random Sets. Proceedings of KDD-Cup 2009 Competition, in Proceedings of Machine Learning Research 7:89-100 Available from https://proceedings.mlr.press/v7/nikulin09.html.

Related Material

Download PDF