Lightspeed Geometric Dataset Distance via Sliced Optimal Transport

Khai Nguyen, Hai Nguyen, Tuan Pham, Nhat Ho
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:46162-46177, 2025.

Abstract

We introduce sliced optimal transport dataset distance (s-OTDD), a model-agnostic, embedding-agnostic approach for dataset comparison that requires no training, is robust to variations in the number of classes, and can handle disjoint label sets. The core innovation is Moment Transform Projection (MTP), which maps a label, represented as a distribution over features, to a real number. Using MTP, we derive a data point projection that transforms datasets into one-dimensional distributions. The s-OTDD is defined as the expected Wasserstein distance between the projected distributions, with respect to random projection parameters. Leveraging the closed form solution of one-dimensional optimal transport, s-OTDD achieves (near-)linear computational complexity in the number of data points and feature dimensions and is independent of the number of classes. With its geometrically meaningful projection, s-OTDD strongly correlates with the optimal transport dataset distance while being more efficient than existing dataset discrepancy measures. Moreover, it correlates well with the performance gap in transfer learning and classification accuracy in data augmentation.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-nguyen25g, title = {Lightspeed Geometric Dataset Distance via Sliced Optimal Transport}, author = {Nguyen, Khai and Nguyen, Hai and Pham, Tuan and Ho, Nhat}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {46162--46177}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/nguyen25g/nguyen25g.pdf}, url = {https://proceedings.mlr.press/v267/nguyen25g.html}, abstract = {We introduce sliced optimal transport dataset distance (s-OTDD), a model-agnostic, embedding-agnostic approach for dataset comparison that requires no training, is robust to variations in the number of classes, and can handle disjoint label sets. The core innovation is Moment Transform Projection (MTP), which maps a label, represented as a distribution over features, to a real number. Using MTP, we derive a data point projection that transforms datasets into one-dimensional distributions. The s-OTDD is defined as the expected Wasserstein distance between the projected distributions, with respect to random projection parameters. Leveraging the closed form solution of one-dimensional optimal transport, s-OTDD achieves (near-)linear computational complexity in the number of data points and feature dimensions and is independent of the number of classes. With its geometrically meaningful projection, s-OTDD strongly correlates with the optimal transport dataset distance while being more efficient than existing dataset discrepancy measures. Moreover, it correlates well with the performance gap in transfer learning and classification accuracy in data augmentation.} }
Endnote
%0 Conference Paper %T Lightspeed Geometric Dataset Distance via Sliced Optimal Transport %A Khai Nguyen %A Hai Nguyen %A Tuan Pham %A Nhat Ho %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-nguyen25g %I PMLR %P 46162--46177 %U https://proceedings.mlr.press/v267/nguyen25g.html %V 267 %X We introduce sliced optimal transport dataset distance (s-OTDD), a model-agnostic, embedding-agnostic approach for dataset comparison that requires no training, is robust to variations in the number of classes, and can handle disjoint label sets. The core innovation is Moment Transform Projection (MTP), which maps a label, represented as a distribution over features, to a real number. Using MTP, we derive a data point projection that transforms datasets into one-dimensional distributions. The s-OTDD is defined as the expected Wasserstein distance between the projected distributions, with respect to random projection parameters. Leveraging the closed form solution of one-dimensional optimal transport, s-OTDD achieves (near-)linear computational complexity in the number of data points and feature dimensions and is independent of the number of classes. With its geometrically meaningful projection, s-OTDD strongly correlates with the optimal transport dataset distance while being more efficient than existing dataset discrepancy measures. Moreover, it correlates well with the performance gap in transfer learning and classification accuracy in data augmentation.
APA
Nguyen, K., Nguyen, H., Pham, T. & Ho, N.. (2025). Lightspeed Geometric Dataset Distance via Sliced Optimal Transport. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:46162-46177 Available from https://proceedings.mlr.press/v267/nguyen25g.html.

Related Material