Open-Sampling: Exploring Out-of-Distribution data for Re-balancing Long-tailed datasets

Hongxin Wei; Lue Tao; Renchunzi Xie; Lei Feng; Bo An

Open-Sampling: Exploring Out-of-Distribution data for Re-balancing Long-tailed datasets

Hongxin Wei, Lue Tao, Renchunzi Xie, Lei Feng, Bo An

Proceedings of the 39th International Conference on Machine Learning, PMLR 162:23615-23630, 2022.

Abstract

Deep neural networks usually perform poorly when the training dataset suffers from extreme class imbalance. Recent studies found that directly training with out-of-distribution data (i.e., open-set samples) in a semi-supervised manner would harm the generalization performance. In this work, we theoretically show that out-of-distribution data can still be leveraged to augment the minority classes from a Bayesian perspective. Based on this motivation, we propose a novel method called Open-sampling, which utilizes open-set noisy labels to re-balance the class priors of the training dataset. For each open-set instance, the label is sampled from our pre-defined distribution that is complementary to the distribution of original class priors. We empirically show that Open-sampling not only re-balances the class priors but also encourages the neural network to learn separable representations. Extensive experiments demonstrate that our proposed method significantly outperforms existing data re-balancing methods and can boost the performance of existing state-of-the-art methods.

Cite this Paper

BibTeX


@InProceedings{pmlr-v162-wei22c,
  title = 	 {Open-Sampling: Exploring Out-of-Distribution data for Re-balancing Long-tailed datasets},
  author =       {Wei, Hongxin and Tao, Lue and Xie, Renchunzi and Feng, Lei and An, Bo},
  booktitle = 	 {Proceedings of the 39th International Conference on Machine Learning},
  pages = 	 {23615--23630},
  year = 	 {2022},
  editor = 	 {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan},
  volume = 	 {162},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {17--23 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v162/wei22c/wei22c.pdf},
  url = 	 {https://proceedings.mlr.press/v162/wei22c.html},
  abstract = 	 {Deep neural networks usually perform poorly when the training dataset suffers from extreme class imbalance. Recent studies found that directly training with out-of-distribution data (i.e., open-set samples) in a semi-supervised manner would harm the generalization performance. In this work, we theoretically show that out-of-distribution data can still be leveraged to augment the minority classes from a Bayesian perspective. Based on this motivation, we propose a novel method called Open-sampling, which utilizes open-set noisy labels to re-balance the class priors of the training dataset. For each open-set instance, the label is sampled from our pre-defined distribution that is complementary to the distribution of original class priors. We empirically show that Open-sampling not only re-balances the class priors but also encourages the neural network to learn separable representations. Extensive experiments demonstrate that our proposed method significantly outperforms existing data re-balancing methods and can boost the performance of existing state-of-the-art methods.}
}

Endnote

%0 Conference Paper
%T Open-Sampling: Exploring Out-of-Distribution data for Re-balancing Long-tailed datasets
%A Hongxin Wei
%A Lue Tao
%A Renchunzi Xie
%A Lei Feng
%A Bo An
%B Proceedings of the 39th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2022
%E Kamalika Chaudhuri
%E Stefanie Jegelka
%E Le Song
%E Csaba Szepesvari
%E Gang Niu
%E Sivan Sabato	
%F pmlr-v162-wei22c
%I PMLR
%P 23615--23630
%U https://proceedings.mlr.press/v162/wei22c.html
%V 162
%X Deep neural networks usually perform poorly when the training dataset suffers from extreme class imbalance. Recent studies found that directly training with out-of-distribution data (i.e., open-set samples) in a semi-supervised manner would harm the generalization performance. In this work, we theoretically show that out-of-distribution data can still be leveraged to augment the minority classes from a Bayesian perspective. Based on this motivation, we propose a novel method called Open-sampling, which utilizes open-set noisy labels to re-balance the class priors of the training dataset. For each open-set instance, the label is sampled from our pre-defined distribution that is complementary to the distribution of original class priors. We empirically show that Open-sampling not only re-balances the class priors but also encourages the neural network to learn separable representations. Extensive experiments demonstrate that our proposed method significantly outperforms existing data re-balancing methods and can boost the performance of existing state-of-the-art methods.

APA


Wei, H., Tao, L., Xie, R., Feng, L. & An, B.. (2022). Open-Sampling: Exploring Out-of-Distribution data for Re-balancing Long-tailed datasets. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:23615-23630 Available from https://proceedings.mlr.press/v162/wei22c.html.

Open-Sampling: Exploring Out-of-Distribution data for Re-balancing Long-tailed datasets

Abstract

Cite this Paper

Related Material