Balancing the Scales: A Theoretical and Algorithmic Framework for Learning from Imbalanced Data

Corinna Cortes, Anqi Mao, Mehryar Mohri, Yutao Zhong
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:11255-11287, 2025.

Abstract

Class imbalance remains a major challenge in machine learning, especially in multi-class problems with long-tailed distributions. Existing methods, such as data resampling, cost-sensitive techniques, and logistic loss modifications, though popular and often effective, lack solid theoretical foundations. As an example, we demonstrate that cost-sensitive methods are not Bayes-consistent. This paper introduces a novel theoretical framework for analyzing generalization in imbalanced classification. We propose a new class-imbalanced margin loss function for both binary and multi-class settings, prove its strong $H$-consistency, and derive corresponding learning guarantees based on empirical loss and a new notion of class-sensitive Rademacher complexity. Leveraging these theoretical results, we devise novel and general learning algorithms, IMMAX (Imbalanced Margin Maximization), which incorporate confidence margins and are applicable to various hypothesis sets. While our focus is theoretical, we also present extensive empirical results demonstrating the effectiveness of our algorithms compared to existing baselines.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-cortes25a, title = {Balancing the Scales: A Theoretical and Algorithmic Framework for Learning from Imbalanced Data}, author = {Cortes, Corinna and Mao, Anqi and Mohri, Mehryar and Zhong, Yutao}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {11255--11287}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/cortes25a/cortes25a.pdf}, url = {https://proceedings.mlr.press/v267/cortes25a.html}, abstract = {Class imbalance remains a major challenge in machine learning, especially in multi-class problems with long-tailed distributions. Existing methods, such as data resampling, cost-sensitive techniques, and logistic loss modifications, though popular and often effective, lack solid theoretical foundations. As an example, we demonstrate that cost-sensitive methods are not Bayes-consistent. This paper introduces a novel theoretical framework for analyzing generalization in imbalanced classification. We propose a new class-imbalanced margin loss function for both binary and multi-class settings, prove its strong $H$-consistency, and derive corresponding learning guarantees based on empirical loss and a new notion of class-sensitive Rademacher complexity. Leveraging these theoretical results, we devise novel and general learning algorithms, IMMAX (Imbalanced Margin Maximization), which incorporate confidence margins and are applicable to various hypothesis sets. While our focus is theoretical, we also present extensive empirical results demonstrating the effectiveness of our algorithms compared to existing baselines.} }
Endnote
%0 Conference Paper %T Balancing the Scales: A Theoretical and Algorithmic Framework for Learning from Imbalanced Data %A Corinna Cortes %A Anqi Mao %A Mehryar Mohri %A Yutao Zhong %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-cortes25a %I PMLR %P 11255--11287 %U https://proceedings.mlr.press/v267/cortes25a.html %V 267 %X Class imbalance remains a major challenge in machine learning, especially in multi-class problems with long-tailed distributions. Existing methods, such as data resampling, cost-sensitive techniques, and logistic loss modifications, though popular and often effective, lack solid theoretical foundations. As an example, we demonstrate that cost-sensitive methods are not Bayes-consistent. This paper introduces a novel theoretical framework for analyzing generalization in imbalanced classification. We propose a new class-imbalanced margin loss function for both binary and multi-class settings, prove its strong $H$-consistency, and derive corresponding learning guarantees based on empirical loss and a new notion of class-sensitive Rademacher complexity. Leveraging these theoretical results, we devise novel and general learning algorithms, IMMAX (Imbalanced Margin Maximization), which incorporate confidence margins and are applicable to various hypothesis sets. While our focus is theoretical, we also present extensive empirical results demonstrating the effectiveness of our algorithms compared to existing baselines.
APA
Cortes, C., Mao, A., Mohri, M. & Zhong, Y.. (2025). Balancing the Scales: A Theoretical and Algorithmic Framework for Learning from Imbalanced Data. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:11255-11287 Available from https://proceedings.mlr.press/v267/cortes25a.html.

Related Material