Training Data Soft Selection via Joint Density Ratio Estimation

Ryuta Matsuno
Proceedings of the 17th Asian Conference on Machine Learning, PMLR 304:1038-1053, 2025.

Abstract

This paper studies the training data selection problem, focusing on the selection of effective samples to improve model training using data affected by distributional shifts (i.e., data drifts). Existing drift-detection-based methods struggle with local drifts, while recent drift-localization-based methods lack theoretical support for the problem and are often ineffective. To tackle these issues, this paper proposes TSJD, a training data soft selection method based on joint density ratio estimation. TSJD assigns training weights (i.e., soft selects) to samples based on the estimated joint density ratio to align the selected data with the recent data distribution. By evaluating each sample independently of time, TSJD effectively addresses local data drifts. We also provide theoretical guarantees by deriving an upper bound on the generalization error for models trained with data selected by TSJD. In numerical experiments with four real-world datasets, TSJD shows great versatility, achieving the best or comparable results over baseline methods in all of the experiments.

Cite this Paper


BibTeX
@InProceedings{pmlr-v304-matsuno25a, title = {Training Data Soft Selection via Joint Density Ratio Estimation}, author = {Matsuno, Ryuta}, booktitle = {Proceedings of the 17th Asian Conference on Machine Learning}, pages = {1038--1053}, year = {2025}, editor = {Lee, Hung-yi and Liu, Tongliang}, volume = {304}, series = {Proceedings of Machine Learning Research}, month = {09--12 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v304/main/assets/matsuno25a/matsuno25a.pdf}, url = {https://proceedings.mlr.press/v304/matsuno25a.html}, abstract = {This paper studies the training data selection problem, focusing on the selection of effective samples to improve model training using data affected by distributional shifts (i.e., data drifts). Existing drift-detection-based methods struggle with local drifts, while recent drift-localization-based methods lack theoretical support for the problem and are often ineffective. To tackle these issues, this paper proposes TSJD, a training data soft selection method based on joint density ratio estimation. TSJD assigns training weights (i.e., soft selects) to samples based on the estimated joint density ratio to align the selected data with the recent data distribution. By evaluating each sample independently of time, TSJD effectively addresses local data drifts. We also provide theoretical guarantees by deriving an upper bound on the generalization error for models trained with data selected by TSJD. In numerical experiments with four real-world datasets, TSJD shows great versatility, achieving the best or comparable results over baseline methods in all of the experiments.} }
Endnote
%0 Conference Paper %T Training Data Soft Selection via Joint Density Ratio Estimation %A Ryuta Matsuno %B Proceedings of the 17th Asian Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Hung-yi Lee %E Tongliang Liu %F pmlr-v304-matsuno25a %I PMLR %P 1038--1053 %U https://proceedings.mlr.press/v304/matsuno25a.html %V 304 %X This paper studies the training data selection problem, focusing on the selection of effective samples to improve model training using data affected by distributional shifts (i.e., data drifts). Existing drift-detection-based methods struggle with local drifts, while recent drift-localization-based methods lack theoretical support for the problem and are often ineffective. To tackle these issues, this paper proposes TSJD, a training data soft selection method based on joint density ratio estimation. TSJD assigns training weights (i.e., soft selects) to samples based on the estimated joint density ratio to align the selected data with the recent data distribution. By evaluating each sample independently of time, TSJD effectively addresses local data drifts. We also provide theoretical guarantees by deriving an upper bound on the generalization error for models trained with data selected by TSJD. In numerical experiments with four real-world datasets, TSJD shows great versatility, achieving the best or comparable results over baseline methods in all of the experiments.
APA
Matsuno, R.. (2025). Training Data Soft Selection via Joint Density Ratio Estimation. Proceedings of the 17th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 304:1038-1053 Available from https://proceedings.mlr.press/v304/matsuno25a.html.

Related Material