TAROT: Targeted Data Selection via Optimal Transport

Lan Feng, Fan Nie, Yuejiang Liu, Alexandre Alahi
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:16837-16852, 2025.

Abstract

We propose TAROT, a targeted data selection framework grounded in Optimal Transport theory. Previous targeted data selection methods primarily rely on influence-based greedy heuristics to enhance domain-specific performance. While effective on limited, unimodal data (i.e., data following a single pattern), these methods struggle as target data complexity increases. Specifically, in multimodal distributions, such heuristics fail to account for multiple inherent patterns, leading to suboptimal data selection. This work identifies two primary limitations: (i) the disproportionate impact of dominant feature components in high-dimensional influence estimation, and (ii) the restrictive linear additive assumptions in greedy selection strategies. To address these challenges, TAROT incorporates whitened feature distance to mitigate dominant feature bias, offering a more reliable measure of data influence. Building on this, TAROT leverages whitened feature distance to quantify and minimize the optimal transport distance between selected data and target domains. Notably, this minimization also facilitates the estimation of optimal selection ratios. We evaluate TAROT across multiple tasks, including semantic segmentation, motion prediction, and instruction tuning. Results consistently show that TAROT outperforms state-of-the-art methods, demonstrating its versatility across various deep learning tasks. Code is available at: https://github.com/vita-epfl/TAROT.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-feng25l, title = {{TAROT}: Targeted Data Selection via Optimal Transport}, author = {Feng, Lan and Nie, Fan and Liu, Yuejiang and Alahi, Alexandre}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {16837--16852}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/feng25l/feng25l.pdf}, url = {https://proceedings.mlr.press/v267/feng25l.html}, abstract = {We propose TAROT, a targeted data selection framework grounded in Optimal Transport theory. Previous targeted data selection methods primarily rely on influence-based greedy heuristics to enhance domain-specific performance. While effective on limited, unimodal data (i.e., data following a single pattern), these methods struggle as target data complexity increases. Specifically, in multimodal distributions, such heuristics fail to account for multiple inherent patterns, leading to suboptimal data selection. This work identifies two primary limitations: (i) the disproportionate impact of dominant feature components in high-dimensional influence estimation, and (ii) the restrictive linear additive assumptions in greedy selection strategies. To address these challenges, TAROT incorporates whitened feature distance to mitigate dominant feature bias, offering a more reliable measure of data influence. Building on this, TAROT leverages whitened feature distance to quantify and minimize the optimal transport distance between selected data and target domains. Notably, this minimization also facilitates the estimation of optimal selection ratios. We evaluate TAROT across multiple tasks, including semantic segmentation, motion prediction, and instruction tuning. Results consistently show that TAROT outperforms state-of-the-art methods, demonstrating its versatility across various deep learning tasks. Code is available at: https://github.com/vita-epfl/TAROT.} }
Endnote
%0 Conference Paper %T TAROT: Targeted Data Selection via Optimal Transport %A Lan Feng %A Fan Nie %A Yuejiang Liu %A Alexandre Alahi %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-feng25l %I PMLR %P 16837--16852 %U https://proceedings.mlr.press/v267/feng25l.html %V 267 %X We propose TAROT, a targeted data selection framework grounded in Optimal Transport theory. Previous targeted data selection methods primarily rely on influence-based greedy heuristics to enhance domain-specific performance. While effective on limited, unimodal data (i.e., data following a single pattern), these methods struggle as target data complexity increases. Specifically, in multimodal distributions, such heuristics fail to account for multiple inherent patterns, leading to suboptimal data selection. This work identifies two primary limitations: (i) the disproportionate impact of dominant feature components in high-dimensional influence estimation, and (ii) the restrictive linear additive assumptions in greedy selection strategies. To address these challenges, TAROT incorporates whitened feature distance to mitigate dominant feature bias, offering a more reliable measure of data influence. Building on this, TAROT leverages whitened feature distance to quantify and minimize the optimal transport distance between selected data and target domains. Notably, this minimization also facilitates the estimation of optimal selection ratios. We evaluate TAROT across multiple tasks, including semantic segmentation, motion prediction, and instruction tuning. Results consistently show that TAROT outperforms state-of-the-art methods, demonstrating its versatility across various deep learning tasks. Code is available at: https://github.com/vita-epfl/TAROT.
APA
Feng, L., Nie, F., Liu, Y. & Alahi, A.. (2025). TAROT: Targeted Data Selection via Optimal Transport. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:16837-16852 Available from https://proceedings.mlr.press/v267/feng25l.html.

Related Material