Reliable-Data-Split (RDS): Maximizing Model Potential with Reinforced Selection Strategy

Hoang D. Nguyen, Xuan-Son Vu, Quoc-Tuan Truong, Duc-Trong Le
Reliable and Trustworthy Artificial Intelligence 2025, PMLR 310:73-89, 2025.

Abstract

The nexus between data characteristics and parametric models is fundamental for developing effective and reliable artificial intelligence (AI) systems. Mismatches in data properties for model development may lead to deleterious effects on AI model performance in machine learning practice. This paper proposes a Reliable Data Split (RDS) procedure to learn how to select data points that will generalise the target domain adequately by employing prior knowledge of the data generative process. We introduce a reinforced selection strategy using deep reinforcement learning with diverse black box predictors in maximising ensemble rewards as the proxy of model performance potential while maintaining an appropriate proportionate allocation and the independent and identically distributed (i.i.d.) assumption. A comprehensive evaluation of the RDS procedure is conducted on four real-world datasets, including Madelon, Drug Reviews, MNIST, and Kalapa Credit Scoring Challenge, with coverage of machine learning tasks such as binary classification, multi-class classification, and regression on multivariate, textual, and visual data. The experimental results evidently demonstrate consistent performance improvements of trainable data samples over classical or prior data selection. Hence, we advocate the use of RDS for data splitting in the early stage of machine learning tasks for parameter tuning, model selection and overfitting prevention, as well as, sampling in large-scale AI competitions for searching the best possible and shift-stable solutions.

Cite this Paper


BibTeX
@InProceedings{pmlr-v310-nguyen25c, title = {Reliable-Data-Split (RDS): Maximizing Model Potential with Reinforced Selection Strategy}, author = {Nguyen, Hoang D. and Vu, Xuan-Son and Truong, Quoc-Tuan and Le, Duc-Trong}, booktitle = {Reliable and Trustworthy Artificial Intelligence 2025}, pages = {73--89}, year = {2025}, editor = {Nguyen, Hoang D. and Le, Duc-Trong and Björklund, Johanna and Vu, Xuan-Son}, volume = {310}, series = {Proceedings of Machine Learning Research}, month = {12 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v310/main/assets/nguyen25c/nguyen25c.pdf}, url = {https://proceedings.mlr.press/v310/nguyen25c.html}, abstract = {The nexus between data characteristics and parametric models is fundamental for developing effective and reliable artificial intelligence (AI) systems. Mismatches in data properties for model development may lead to deleterious effects on AI model performance in machine learning practice. This paper proposes a Reliable Data Split (RDS) procedure to learn how to select data points that will generalise the target domain adequately by employing prior knowledge of the data generative process. We introduce a reinforced selection strategy using deep reinforcement learning with diverse black box predictors in maximising ensemble rewards as the proxy of model performance potential while maintaining an appropriate proportionate allocation and the independent and identically distributed (i.i.d.) assumption. A comprehensive evaluation of the RDS procedure is conducted on four real-world datasets, including Madelon, Drug Reviews, MNIST, and Kalapa Credit Scoring Challenge, with coverage of machine learning tasks such as binary classification, multi-class classification, and regression on multivariate, textual, and visual data. The experimental results evidently demonstrate consistent performance improvements of trainable data samples over classical or prior data selection. Hence, we advocate the use of RDS for data splitting in the early stage of machine learning tasks for parameter tuning, model selection and overfitting prevention, as well as, sampling in large-scale AI competitions for searching the best possible and shift-stable solutions.} }
Endnote
%0 Conference Paper %T Reliable-Data-Split (RDS): Maximizing Model Potential with Reinforced Selection Strategy %A Hoang D. Nguyen %A Xuan-Son Vu %A Quoc-Tuan Truong %A Duc-Trong Le %B Reliable and Trustworthy Artificial Intelligence 2025 %C Proceedings of Machine Learning Research %D 2025 %E Hoang D. Nguyen %E Duc-Trong Le %E Johanna Björklund %E Xuan-Son Vu %F pmlr-v310-nguyen25c %I PMLR %P 73--89 %U https://proceedings.mlr.press/v310/nguyen25c.html %V 310 %X The nexus between data characteristics and parametric models is fundamental for developing effective and reliable artificial intelligence (AI) systems. Mismatches in data properties for model development may lead to deleterious effects on AI model performance in machine learning practice. This paper proposes a Reliable Data Split (RDS) procedure to learn how to select data points that will generalise the target domain adequately by employing prior knowledge of the data generative process. We introduce a reinforced selection strategy using deep reinforcement learning with diverse black box predictors in maximising ensemble rewards as the proxy of model performance potential while maintaining an appropriate proportionate allocation and the independent and identically distributed (i.i.d.) assumption. A comprehensive evaluation of the RDS procedure is conducted on four real-world datasets, including Madelon, Drug Reviews, MNIST, and Kalapa Credit Scoring Challenge, with coverage of machine learning tasks such as binary classification, multi-class classification, and regression on multivariate, textual, and visual data. The experimental results evidently demonstrate consistent performance improvements of trainable data samples over classical or prior data selection. Hence, we advocate the use of RDS for data splitting in the early stage of machine learning tasks for parameter tuning, model selection and overfitting prevention, as well as, sampling in large-scale AI competitions for searching the best possible and shift-stable solutions.
APA
Nguyen, H.D., Vu, X., Truong, Q. & Le, D.. (2025). Reliable-Data-Split (RDS): Maximizing Model Potential with Reinforced Selection Strategy. Reliable and Trustworthy Artificial Intelligence 2025, in Proceedings of Machine Learning Research 310:73-89 Available from https://proceedings.mlr.press/v310/nguyen25c.html.

Related Material