Is Merging Worth It? Securely Evaluating the Information Gain for Causal Dataset Acquisition

Jake Fawkes, Lucile Ter-Minassian, Desi R. Ivanova, Uri Shalit, Christopher C. Holmes
Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, PMLR 258:1423-1431, 2025.

Abstract

Merging datasets across institutions is a lengthy and costly procedure, especially when it involves private information. Data hosts may therefore want to prospectively gauge which datasets are most beneficial to merge with, without revealing sensitive information. For causal estimation this is particularly challenging as the value of a merge depends not only on reduction in epistemic uncertainty but also on improvement in overlap. To address this challenge, we introduce the first \emph{cryptographically secure} information-theoretic approach for quantifying the value of a merge in the context of heterogeneous treatment effect estimation. We do this by evaluating the \emph{Expected Information Gain} (EIG) using multi-party computation to ensure that no raw data is revealed. We further demonstrate that our approach can be combined with differential privacy (DP) to meet arbitrary privacy requirements whilst preserving more accurate computation compared to DP alone. To the best of our knowledge, this work presents the first privacy-preserving method for dataset acquisition tailored to causal estimation.Code is publicly available: \url{https://github.com/LucileTerminassian/causal_prospective_merge}.

Cite this Paper


BibTeX
@InProceedings{pmlr-v258-fawkes25a, title = {Is Merging Worth It? Securely Evaluating the Information Gain for Causal Dataset Acquisition}, author = {Fawkes, Jake and Ter-Minassian, Lucile and Ivanova, Desi R. and Shalit, Uri and Holmes, Christopher C.}, booktitle = {Proceedings of The 28th International Conference on Artificial Intelligence and Statistics}, pages = {1423--1431}, year = {2025}, editor = {Li, Yingzhen and Mandt, Stephan and Agrawal, Shipra and Khan, Emtiyaz}, volume = {258}, series = {Proceedings of Machine Learning Research}, month = {03--05 May}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v258/main/assets/fawkes25a/fawkes25a.pdf}, url = {https://proceedings.mlr.press/v258/fawkes25a.html}, abstract = {Merging datasets across institutions is a lengthy and costly procedure, especially when it involves private information. Data hosts may therefore want to prospectively gauge which datasets are most beneficial to merge with, without revealing sensitive information. For causal estimation this is particularly challenging as the value of a merge depends not only on reduction in epistemic uncertainty but also on improvement in overlap. To address this challenge, we introduce the first \emph{cryptographically secure} information-theoretic approach for quantifying the value of a merge in the context of heterogeneous treatment effect estimation. We do this by evaluating the \emph{Expected Information Gain} (EIG) using multi-party computation to ensure that no raw data is revealed. We further demonstrate that our approach can be combined with differential privacy (DP) to meet arbitrary privacy requirements whilst preserving more accurate computation compared to DP alone. To the best of our knowledge, this work presents the first privacy-preserving method for dataset acquisition tailored to causal estimation.Code is publicly available: \url{https://github.com/LucileTerminassian/causal_prospective_merge}.} }
Endnote
%0 Conference Paper %T Is Merging Worth It? Securely Evaluating the Information Gain for Causal Dataset Acquisition %A Jake Fawkes %A Lucile Ter-Minassian %A Desi R. Ivanova %A Uri Shalit %A Christopher C. Holmes %B Proceedings of The 28th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2025 %E Yingzhen Li %E Stephan Mandt %E Shipra Agrawal %E Emtiyaz Khan %F pmlr-v258-fawkes25a %I PMLR %P 1423--1431 %U https://proceedings.mlr.press/v258/fawkes25a.html %V 258 %X Merging datasets across institutions is a lengthy and costly procedure, especially when it involves private information. Data hosts may therefore want to prospectively gauge which datasets are most beneficial to merge with, without revealing sensitive information. For causal estimation this is particularly challenging as the value of a merge depends not only on reduction in epistemic uncertainty but also on improvement in overlap. To address this challenge, we introduce the first \emph{cryptographically secure} information-theoretic approach for quantifying the value of a merge in the context of heterogeneous treatment effect estimation. We do this by evaluating the \emph{Expected Information Gain} (EIG) using multi-party computation to ensure that no raw data is revealed. We further demonstrate that our approach can be combined with differential privacy (DP) to meet arbitrary privacy requirements whilst preserving more accurate computation compared to DP alone. To the best of our knowledge, this work presents the first privacy-preserving method for dataset acquisition tailored to causal estimation.Code is publicly available: \url{https://github.com/LucileTerminassian/causal_prospective_merge}.
APA
Fawkes, J., Ter-Minassian, L., Ivanova, D.R., Shalit, U. & Holmes, C.C.. (2025). Is Merging Worth It? Securely Evaluating the Information Gain for Causal Dataset Acquisition. Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 258:1423-1431 Available from https://proceedings.mlr.press/v258/fawkes25a.html.

Related Material