RCAP: Robust, Class-Aware, Probabilistic Dynamic Dataset Pruning

Atif Hassan, Swanand Khare, Jiaul H. Paik
Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence, PMLR 286:1648-1662, 2025.

Abstract

Dynamic data pruning techniques aim to reduce computational cost while minimizing information loss, by periodically selecting representative subsets of input data during model training. However, existing methods often struggle to maintain strong worst-group accuracy, particularly at high pruning rates, across balanced and imbalanced datasets. To address this challenge, we propose RCAP, a Robust, Class-Aware, Probabilistic dynamic dataset pruning algorithm for classification tasks. RCAP applies a closed-form solution to estimate the fraction of samples to be included in the training subset for each individual class. This fraction is adaptively adjusted in every epoch using class-wise aggregated loss. Thereafter, it employs an adaptive sampling strategy that prioritizes samples having high loss for populating the class-wise subsets. We evaluate RCAP on six diverse datasets ranging from class-balanced to highly imbalanced using five distinct models across three training paradigms: training from scratch, transfer learning, and fine-tuning. Our approach consistently outperforms state-of-the-art dataset pruning methods, achieving superior worst-group accuracy at all pruning rates. Remarkably, with only $10$% data, RCAP delivers $>1$% improvement in performance on class-imbalanced datasets compared to full data training, while providing an average $8.69\times$ speedup. The code can be accessed at https://github.com/atif-hassan/RCAP-dynamic-dataset-pruning

Cite this Paper


BibTeX
@InProceedings{pmlr-v286-hassan25a, title = {RCAP: Robust, Class-Aware, Probabilistic Dynamic Dataset Pruning}, author = {Hassan, Atif and Khare, Swanand and Paik, Jiaul H.}, booktitle = {Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence}, pages = {1648--1662}, year = {2025}, editor = {Chiappa, Silvia and Magliacane, Sara}, volume = {286}, series = {Proceedings of Machine Learning Research}, month = {21--25 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v286/main/assets/hassan25a/hassan25a.pdf}, url = {https://proceedings.mlr.press/v286/hassan25a.html}, abstract = {Dynamic data pruning techniques aim to reduce computational cost while minimizing information loss, by periodically selecting representative subsets of input data during model training. However, existing methods often struggle to maintain strong worst-group accuracy, particularly at high pruning rates, across balanced and imbalanced datasets. To address this challenge, we propose RCAP, a Robust, Class-Aware, Probabilistic dynamic dataset pruning algorithm for classification tasks. RCAP applies a closed-form solution to estimate the fraction of samples to be included in the training subset for each individual class. This fraction is adaptively adjusted in every epoch using class-wise aggregated loss. Thereafter, it employs an adaptive sampling strategy that prioritizes samples having high loss for populating the class-wise subsets. We evaluate RCAP on six diverse datasets ranging from class-balanced to highly imbalanced using five distinct models across three training paradigms: training from scratch, transfer learning, and fine-tuning. Our approach consistently outperforms state-of-the-art dataset pruning methods, achieving superior worst-group accuracy at all pruning rates. Remarkably, with only $10$% data, RCAP delivers $>1$% improvement in performance on class-imbalanced datasets compared to full data training, while providing an average $8.69\times$ speedup. The code can be accessed at https://github.com/atif-hassan/RCAP-dynamic-dataset-pruning} }
Endnote
%0 Conference Paper %T RCAP: Robust, Class-Aware, Probabilistic Dynamic Dataset Pruning %A Atif Hassan %A Swanand Khare %A Jiaul H. Paik %B Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence %C Proceedings of Machine Learning Research %D 2025 %E Silvia Chiappa %E Sara Magliacane %F pmlr-v286-hassan25a %I PMLR %P 1648--1662 %U https://proceedings.mlr.press/v286/hassan25a.html %V 286 %X Dynamic data pruning techniques aim to reduce computational cost while minimizing information loss, by periodically selecting representative subsets of input data during model training. However, existing methods often struggle to maintain strong worst-group accuracy, particularly at high pruning rates, across balanced and imbalanced datasets. To address this challenge, we propose RCAP, a Robust, Class-Aware, Probabilistic dynamic dataset pruning algorithm for classification tasks. RCAP applies a closed-form solution to estimate the fraction of samples to be included in the training subset for each individual class. This fraction is adaptively adjusted in every epoch using class-wise aggregated loss. Thereafter, it employs an adaptive sampling strategy that prioritizes samples having high loss for populating the class-wise subsets. We evaluate RCAP on six diverse datasets ranging from class-balanced to highly imbalanced using five distinct models across three training paradigms: training from scratch, transfer learning, and fine-tuning. Our approach consistently outperforms state-of-the-art dataset pruning methods, achieving superior worst-group accuracy at all pruning rates. Remarkably, with only $10$% data, RCAP delivers $>1$% improvement in performance on class-imbalanced datasets compared to full data training, while providing an average $8.69\times$ speedup. The code can be accessed at https://github.com/atif-hassan/RCAP-dynamic-dataset-pruning
APA
Hassan, A., Khare, S. & Paik, J.H.. (2025). RCAP: Robust, Class-Aware, Probabilistic Dynamic Dataset Pruning. Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence, in Proceedings of Machine Learning Research 286:1648-1662 Available from https://proceedings.mlr.press/v286/hassan25a.html.

Related Material