When Dynamic Data Selection Meets Data Augmentation: Achieving Enhanced Training Acceleration

Suorong Yang, Peng Ye, Furao Shen, Dongzhan Zhou
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:71508-71520, 2025.

Abstract

Dynamic data selection aims to accelerate training with lossless performances. However, reducing training data inherently limits data diversity, potentially hindering generalization. While data augmentation is widely used to enhance diversity, it is typically not optimized in conjunction with selection. As a result, directly combining these techniques fails to fully exploit their synergies. To tackle the challenge, we propose a novel online data training framework that, for the first time, unifies dynamic data selection and augmentation, achieving both training efficiency and enhanced performance. Our method estimates each sample’s joint distribution of local density and multimodal semantic consistency, allowing for the targeted selection of augmentation-suitable samples while suppressing the inclusion of noisy or ambiguous data. This enables a more significant reduction in dataset size without sacrificing model generalization. Experimental results demonstrate that our method outperforms existing state-of-the-art approaches on various benchmark datasets and architectures, e.g., reducing 50% training costs on ImageNet-1k with lossless performance. Furthermore, our approach enhances noise resistance and improves model robustness, reinforcing its practical utility in real-world scenarios.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-yang25as, title = {When Dynamic Data Selection Meets Data Augmentation: Achieving Enhanced Training Acceleration}, author = {Yang, Suorong and Ye, Peng and Shen, Furao and Zhou, Dongzhan}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {71508--71520}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/yang25as/yang25as.pdf}, url = {https://proceedings.mlr.press/v267/yang25as.html}, abstract = {Dynamic data selection aims to accelerate training with lossless performances. However, reducing training data inherently limits data diversity, potentially hindering generalization. While data augmentation is widely used to enhance diversity, it is typically not optimized in conjunction with selection. As a result, directly combining these techniques fails to fully exploit their synergies. To tackle the challenge, we propose a novel online data training framework that, for the first time, unifies dynamic data selection and augmentation, achieving both training efficiency and enhanced performance. Our method estimates each sample’s joint distribution of local density and multimodal semantic consistency, allowing for the targeted selection of augmentation-suitable samples while suppressing the inclusion of noisy or ambiguous data. This enables a more significant reduction in dataset size without sacrificing model generalization. Experimental results demonstrate that our method outperforms existing state-of-the-art approaches on various benchmark datasets and architectures, e.g., reducing 50% training costs on ImageNet-1k with lossless performance. Furthermore, our approach enhances noise resistance and improves model robustness, reinforcing its practical utility in real-world scenarios.} }
Endnote
%0 Conference Paper %T When Dynamic Data Selection Meets Data Augmentation: Achieving Enhanced Training Acceleration %A Suorong Yang %A Peng Ye %A Furao Shen %A Dongzhan Zhou %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-yang25as %I PMLR %P 71508--71520 %U https://proceedings.mlr.press/v267/yang25as.html %V 267 %X Dynamic data selection aims to accelerate training with lossless performances. However, reducing training data inherently limits data diversity, potentially hindering generalization. While data augmentation is widely used to enhance diversity, it is typically not optimized in conjunction with selection. As a result, directly combining these techniques fails to fully exploit their synergies. To tackle the challenge, we propose a novel online data training framework that, for the first time, unifies dynamic data selection and augmentation, achieving both training efficiency and enhanced performance. Our method estimates each sample’s joint distribution of local density and multimodal semantic consistency, allowing for the targeted selection of augmentation-suitable samples while suppressing the inclusion of noisy or ambiguous data. This enables a more significant reduction in dataset size without sacrificing model generalization. Experimental results demonstrate that our method outperforms existing state-of-the-art approaches on various benchmark datasets and architectures, e.g., reducing 50% training costs on ImageNet-1k with lossless performance. Furthermore, our approach enhances noise resistance and improves model robustness, reinforcing its practical utility in real-world scenarios.
APA
Yang, S., Ye, P., Shen, F. & Zhou, D.. (2025). When Dynamic Data Selection Meets Data Augmentation: Achieving Enhanced Training Acceleration. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:71508-71520 Available from https://proceedings.mlr.press/v267/yang25as.html.

Related Material