Missing Data Imputation using Optimal Transport

Boris Muzellec, Julie Josse, Claire Boyer, Marco Cuturi
Proceedings of the 37th International Conference on Machine Learning, PMLR 119:7130-7140, 2020.

Abstract

Missing data is a crucial issue when applying machine learning algorithms to real-world datasets. Starting from the simple assumption that two batches extracted randomly from the same dataset should share the same distribution, we leverage optimal transport distances to quantify that criterion and turn it into a loss function to impute missing data values. We propose practical methods to minimize these losses using end-to-end learning, that can exploit or not parametric assumptions on the underlying distributions of values. We evaluate our methods on datasets from the UCI repository, in MCAR, MAR and MNAR settings. These experiments show that OT-based methods match or out-perform state-of-the-art imputation methods, even for high percentages of missing values.

Cite this Paper


BibTeX
@InProceedings{pmlr-v119-muzellec20a, title = {Missing Data Imputation using Optimal Transport}, author = {Muzellec, Boris and Josse, Julie and Boyer, Claire and Cuturi, Marco}, booktitle = {Proceedings of the 37th International Conference on Machine Learning}, pages = {7130--7140}, year = {2020}, editor = {III, Hal Daumé and Singh, Aarti}, volume = {119}, series = {Proceedings of Machine Learning Research}, month = {13--18 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v119/muzellec20a/muzellec20a.pdf}, url = {https://proceedings.mlr.press/v119/muzellec20a.html}, abstract = {Missing data is a crucial issue when applying machine learning algorithms to real-world datasets. Starting from the simple assumption that two batches extracted randomly from the same dataset should share the same distribution, we leverage optimal transport distances to quantify that criterion and turn it into a loss function to impute missing data values. We propose practical methods to minimize these losses using end-to-end learning, that can exploit or not parametric assumptions on the underlying distributions of values. We evaluate our methods on datasets from the UCI repository, in MCAR, MAR and MNAR settings. These experiments show that OT-based methods match or out-perform state-of-the-art imputation methods, even for high percentages of missing values.} }
Endnote
%0 Conference Paper %T Missing Data Imputation using Optimal Transport %A Boris Muzellec %A Julie Josse %A Claire Boyer %A Marco Cuturi %B Proceedings of the 37th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2020 %E Hal Daumé III %E Aarti Singh %F pmlr-v119-muzellec20a %I PMLR %P 7130--7140 %U https://proceedings.mlr.press/v119/muzellec20a.html %V 119 %X Missing data is a crucial issue when applying machine learning algorithms to real-world datasets. Starting from the simple assumption that two batches extracted randomly from the same dataset should share the same distribution, we leverage optimal transport distances to quantify that criterion and turn it into a loss function to impute missing data values. We propose practical methods to minimize these losses using end-to-end learning, that can exploit or not parametric assumptions on the underlying distributions of values. We evaluate our methods on datasets from the UCI repository, in MCAR, MAR and MNAR settings. These experiments show that OT-based methods match or out-perform state-of-the-art imputation methods, even for high percentages of missing values.
APA
Muzellec, B., Josse, J., Boyer, C. & Cuturi, M.. (2020). Missing Data Imputation using Optimal Transport. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:7130-7140 Available from https://proceedings.mlr.press/v119/muzellec20a.html.

Related Material