$\texttt{causalAssembly}$: Generating Realistic Production Data for Benchmarking Causal Discovery

Konstantin Göbler, Tobias Windisch, Mathias Drton, Tim Pychynski, Martin Roth, Steffen Sonntag
Proceedings of the Third Conference on Causal Learning and Reasoning, PMLR 236:609-642, 2024.

Abstract

Algorithms for causal discovery have recently undergone rapid advances and increasingly draw on flexible nonparametric methods to process complex data. With these advances comes a need for adequate empirical validation of the causal relationships learned by different algorithms. However, for most real and complex data sources true causal relations remain unknown. This issue is further compounded by privacy concerns surrounding the release of suitable high-quality data. To tackle these challenges, we introduce $\texttt{causalAssembly}$, a semisynthetic data generator designed to facilitate the benchmarking of causal discovery methods. The tool is built using a complex real-world dataset comprised of measurements collected along an assembly line in a manufacturing setting. For these measurements, we establish a partial set of ground truth causal relationships through a detailed study of the physics underlying the processes carried out in the assembly line. The partial ground truth is sufficiently informative to allow for estimation of a full causal graph by mere nonparametric regression. To overcome potential confounding and privacy concerns, we use distributional random forests to estimate and represent conditional distributions implied by the ground truth causal graph. These conditionals are combined into a joint distribution that strictly adheres to a causal model over the observed variables. Sampling from this distribution, $\texttt{causalAssembly}$ generates data that are guaranteed to be Markovian with respect to the ground truth. Using our tool, we showcase how to benchmark several well-known causal discovery algorithms.

Cite this Paper


BibTeX
@InProceedings{pmlr-v236-gobler24a, title = {$\texttt{causalAssembly}$: Generating Realistic Production Data for Benchmarking Causal Discovery}, author = {G\"obler, Konstantin and Windisch, Tobias and Drton, Mathias and Pychynski, Tim and Roth, Martin and Sonntag, Steffen}, booktitle = {Proceedings of the Third Conference on Causal Learning and Reasoning}, pages = {609--642}, year = {2024}, editor = {Locatello, Francesco and Didelez, Vanessa}, volume = {236}, series = {Proceedings of Machine Learning Research}, month = {01--03 Apr}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v236/gobler24a/gobler24a.pdf}, url = {https://proceedings.mlr.press/v236/gobler24a.html}, abstract = {Algorithms for causal discovery have recently undergone rapid advances and increasingly draw on flexible nonparametric methods to process complex data. With these advances comes a need for adequate empirical validation of the causal relationships learned by different algorithms. However, for most real and complex data sources true causal relations remain unknown. This issue is further compounded by privacy concerns surrounding the release of suitable high-quality data. To tackle these challenges, we introduce $\texttt{causalAssembly}$, a semisynthetic data generator designed to facilitate the benchmarking of causal discovery methods. The tool is built using a complex real-world dataset comprised of measurements collected along an assembly line in a manufacturing setting. For these measurements, we establish a partial set of ground truth causal relationships through a detailed study of the physics underlying the processes carried out in the assembly line. The partial ground truth is sufficiently informative to allow for estimation of a full causal graph by mere nonparametric regression. To overcome potential confounding and privacy concerns, we use distributional random forests to estimate and represent conditional distributions implied by the ground truth causal graph. These conditionals are combined into a joint distribution that strictly adheres to a causal model over the observed variables. Sampling from this distribution, $\texttt{causalAssembly}$ generates data that are guaranteed to be Markovian with respect to the ground truth. Using our tool, we showcase how to benchmark several well-known causal discovery algorithms.} }
Endnote
%0 Conference Paper %T $\texttt{causalAssembly}$: Generating Realistic Production Data for Benchmarking Causal Discovery %A Konstantin Göbler %A Tobias Windisch %A Mathias Drton %A Tim Pychynski %A Martin Roth %A Steffen Sonntag %B Proceedings of the Third Conference on Causal Learning and Reasoning %C Proceedings of Machine Learning Research %D 2024 %E Francesco Locatello %E Vanessa Didelez %F pmlr-v236-gobler24a %I PMLR %P 609--642 %U https://proceedings.mlr.press/v236/gobler24a.html %V 236 %X Algorithms for causal discovery have recently undergone rapid advances and increasingly draw on flexible nonparametric methods to process complex data. With these advances comes a need for adequate empirical validation of the causal relationships learned by different algorithms. However, for most real and complex data sources true causal relations remain unknown. This issue is further compounded by privacy concerns surrounding the release of suitable high-quality data. To tackle these challenges, we introduce $\texttt{causalAssembly}$, a semisynthetic data generator designed to facilitate the benchmarking of causal discovery methods. The tool is built using a complex real-world dataset comprised of measurements collected along an assembly line in a manufacturing setting. For these measurements, we establish a partial set of ground truth causal relationships through a detailed study of the physics underlying the processes carried out in the assembly line. The partial ground truth is sufficiently informative to allow for estimation of a full causal graph by mere nonparametric regression. To overcome potential confounding and privacy concerns, we use distributional random forests to estimate and represent conditional distributions implied by the ground truth causal graph. These conditionals are combined into a joint distribution that strictly adheres to a causal model over the observed variables. Sampling from this distribution, $\texttt{causalAssembly}$ generates data that are guaranteed to be Markovian with respect to the ground truth. Using our tool, we showcase how to benchmark several well-known causal discovery algorithms.
APA
Göbler, K., Windisch, T., Drton, M., Pychynski, T., Roth, M. & Sonntag, S.. (2024). $\texttt{causalAssembly}$: Generating Realistic Production Data for Benchmarking Causal Discovery. Proceedings of the Third Conference on Causal Learning and Reasoning, in Proceedings of Machine Learning Research 236:609-642 Available from https://proceedings.mlr.press/v236/gobler24a.html.

Related Material