Masking the Unknown: Leveraging Masked Samples for Enhanced Data Augmentation

Xun Yao, Zijian Huang, Xinrong Hu, Jie Yang, Yi Guo
Proceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence, PMLR 244:3997-4010, 2024.

Abstract

Data Augmentation (DA) has become a widely adopted strategy for addressing data scarcity in numerous NLP tasks, especially in scenarios with limited resources or imbalanced classes. However, many existing augmentation techniques rely on randomness or additional resources, presenting challenges in both performance and practical implementation. Furthermore, there is a lack of exploration into what constitutes effective augmentation. In this paper, we systematically evaluate existing DA methods across a comprehensive range of text-classification benchmarks. The empirical analysis highlights that the most significant change resulting from augmentation is observed in the data variance. This observation inspires the proposed approach, termed Mask-for-Data Augmentation (M4DA), which strategically masks tokens from original samples for augmentation. Specifically, M4DA consists of a Variance-Oriented Masker Module (VMM), which ensures an increase in data variances, and a Complexity-Enhanced Selection Module (CSM), designed to select the augmented sample with the highest semantic complexity. The effectiveness of the proposed method is empirically validated across various text-classification benchmarks, including scenarios with limited or full resources and imbalanced classes. Experimental results demonstrate considerable improvements over state-of-the-arts.

Cite this Paper


BibTeX
@InProceedings{pmlr-v244-yao24b, title = {Masking the Unknown: Leveraging Masked Samples for Enhanced Data Augmentation}, author = {Yao, Xun and Huang, Zijian and Hu, Xinrong and Yang, Jie and Guo, Yi}, booktitle = {Proceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence}, pages = {3997--4010}, year = {2024}, editor = {Kiyavash, Negar and Mooij, Joris M.}, volume = {244}, series = {Proceedings of Machine Learning Research}, month = {15--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v244/main/assets/yao24b/yao24b.pdf}, url = {https://proceedings.mlr.press/v244/yao24b.html}, abstract = {Data Augmentation (DA) has become a widely adopted strategy for addressing data scarcity in numerous NLP tasks, especially in scenarios with limited resources or imbalanced classes. However, many existing augmentation techniques rely on randomness or additional resources, presenting challenges in both performance and practical implementation. Furthermore, there is a lack of exploration into what constitutes effective augmentation. In this paper, we systematically evaluate existing DA methods across a comprehensive range of text-classification benchmarks. The empirical analysis highlights that the most significant change resulting from augmentation is observed in the data variance. This observation inspires the proposed approach, termed Mask-for-Data Augmentation (M4DA), which strategically masks tokens from original samples for augmentation. Specifically, M4DA consists of a Variance-Oriented Masker Module (VMM), which ensures an increase in data variances, and a Complexity-Enhanced Selection Module (CSM), designed to select the augmented sample with the highest semantic complexity. The effectiveness of the proposed method is empirically validated across various text-classification benchmarks, including scenarios with limited or full resources and imbalanced classes. Experimental results demonstrate considerable improvements over state-of-the-arts.} }
Endnote
%0 Conference Paper %T Masking the Unknown: Leveraging Masked Samples for Enhanced Data Augmentation %A Xun Yao %A Zijian Huang %A Xinrong Hu %A Jie Yang %A Yi Guo %B Proceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence %C Proceedings of Machine Learning Research %D 2024 %E Negar Kiyavash %E Joris M. Mooij %F pmlr-v244-yao24b %I PMLR %P 3997--4010 %U https://proceedings.mlr.press/v244/yao24b.html %V 244 %X Data Augmentation (DA) has become a widely adopted strategy for addressing data scarcity in numerous NLP tasks, especially in scenarios with limited resources or imbalanced classes. However, many existing augmentation techniques rely on randomness or additional resources, presenting challenges in both performance and practical implementation. Furthermore, there is a lack of exploration into what constitutes effective augmentation. In this paper, we systematically evaluate existing DA methods across a comprehensive range of text-classification benchmarks. The empirical analysis highlights that the most significant change resulting from augmentation is observed in the data variance. This observation inspires the proposed approach, termed Mask-for-Data Augmentation (M4DA), which strategically masks tokens from original samples for augmentation. Specifically, M4DA consists of a Variance-Oriented Masker Module (VMM), which ensures an increase in data variances, and a Complexity-Enhanced Selection Module (CSM), designed to select the augmented sample with the highest semantic complexity. The effectiveness of the proposed method is empirically validated across various text-classification benchmarks, including scenarios with limited or full resources and imbalanced classes. Experimental results demonstrate considerable improvements over state-of-the-arts.
APA
Yao, X., Huang, Z., Hu, X., Yang, J. & Guo, Y.. (2024). Masking the Unknown: Leveraging Masked Samples for Enhanced Data Augmentation. Proceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence, in Proceedings of Machine Learning Research 244:3997-4010 Available from https://proceedings.mlr.press/v244/yao24b.html.

Related Material