Unitless Unrestricted Markov-Consistent SCM Generation: Better Benchmark Datasets for Causal Discovery

Rebecca J. Herman, Jonas Wahl, Urmi Ninad, Jakob Runge
Proceedings of the Fourth Conference on Causal Learning and Reasoning, PMLR 275:1506-1531, 2025.

Abstract

Causal discovery aims to extract qualitative causal knowledge in the form of causal graphs from data. Because causal ground truth is rarely known in the real world, simulated data plays a vital role in evaluating the performance of the various causal discovery algorithms proposed in the literature. But recent work highlighted certain artifacts of commonly used data generation techniques for a standard class of structural causal models (SCM) that may be nonphysical, including var- and R2-sortability, where the variables’ variance and coefficients of determination (R2) after regressing on all other variables, respectively, increase along the causal order. Some causal methods exploit such artifacts, leading to unrealistic expectations for their performance on real-world data. Some modifications have been proposed to remove these artifacts; notably, the internally-standardized structural causal model (iSCM) avoids varsortability and largely alleviates R2-sortability on sparse causal graphs, but exhibits a reversed R2-sortability pattern for denser graphs not featured in their work. We analyze which sortability patterns we expect to see in real data, and propose a method for drawing coefficients that we argue more effectively samples the space of SCMs. Finally, we propose a novel extension of our SCM generation method to the time series setting.

Cite this Paper


BibTeX
@InProceedings{pmlr-v275-herman25a, title = {Unitless Unrestricted Markov-Consistent SCM Generation: Better Benchmark Datasets for Causal Discovery}, author = {Herman, Rebecca J. and Wahl, Jonas and Ninad, Urmi and Runge, Jakob}, booktitle = {Proceedings of the Fourth Conference on Causal Learning and Reasoning}, pages = {1506--1531}, year = {2025}, editor = {Huang, Biwei and Drton, Mathias}, volume = {275}, series = {Proceedings of Machine Learning Research}, month = {07--09 May}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v275/main/assets/herman25a/herman25a.pdf}, url = {https://proceedings.mlr.press/v275/herman25a.html}, abstract = {Causal discovery aims to extract qualitative causal knowledge in the form of causal graphs from data. Because causal ground truth is rarely known in the real world, simulated data plays a vital role in evaluating the performance of the various causal discovery algorithms proposed in the literature. But recent work highlighted certain artifacts of commonly used data generation techniques for a standard class of structural causal models (SCM) that may be nonphysical, including var- and R2-sortability, where the variables’ variance and coefficients of determination (R2) after regressing on all other variables, respectively, increase along the causal order. Some causal methods exploit such artifacts, leading to unrealistic expectations for their performance on real-world data. Some modifications have been proposed to remove these artifacts; notably, the internally-standardized structural causal model (iSCM) avoids varsortability and largely alleviates R2-sortability on sparse causal graphs, but exhibits a reversed R2-sortability pattern for denser graphs not featured in their work. We analyze which sortability patterns we expect to see in real data, and propose a method for drawing coefficients that we argue more effectively samples the space of SCMs. Finally, we propose a novel extension of our SCM generation method to the time series setting.} }
Endnote
%0 Conference Paper %T Unitless Unrestricted Markov-Consistent SCM Generation: Better Benchmark Datasets for Causal Discovery %A Rebecca J. Herman %A Jonas Wahl %A Urmi Ninad %A Jakob Runge %B Proceedings of the Fourth Conference on Causal Learning and Reasoning %C Proceedings of Machine Learning Research %D 2025 %E Biwei Huang %E Mathias Drton %F pmlr-v275-herman25a %I PMLR %P 1506--1531 %U https://proceedings.mlr.press/v275/herman25a.html %V 275 %X Causal discovery aims to extract qualitative causal knowledge in the form of causal graphs from data. Because causal ground truth is rarely known in the real world, simulated data plays a vital role in evaluating the performance of the various causal discovery algorithms proposed in the literature. But recent work highlighted certain artifacts of commonly used data generation techniques for a standard class of structural causal models (SCM) that may be nonphysical, including var- and R2-sortability, where the variables’ variance and coefficients of determination (R2) after regressing on all other variables, respectively, increase along the causal order. Some causal methods exploit such artifacts, leading to unrealistic expectations for their performance on real-world data. Some modifications have been proposed to remove these artifacts; notably, the internally-standardized structural causal model (iSCM) avoids varsortability and largely alleviates R2-sortability on sparse causal graphs, but exhibits a reversed R2-sortability pattern for denser graphs not featured in their work. We analyze which sortability patterns we expect to see in real data, and propose a method for drawing coefficients that we argue more effectively samples the space of SCMs. Finally, we propose a novel extension of our SCM generation method to the time series setting.
APA
Herman, R.J., Wahl, J., Ninad, U. & Runge, J.. (2025). Unitless Unrestricted Markov-Consistent SCM Generation: Better Benchmark Datasets for Causal Discovery. Proceedings of the Fourth Conference on Causal Learning and Reasoning, in Proceedings of Machine Learning Research 275:1506-1531 Available from https://proceedings.mlr.press/v275/herman25a.html.

Related Material