Mathematical Foundations of Modeling ETL Process Chains

Levin Maier, Lucas Schulze, Robert Lilow, Lukas Hahn, Niko Krasowski, Arnulf Barth, Sebastian Gaebel, Ferdi Gueran, Giovanni Wagner, Falk Borgmann, Oleg Arenz, Jan Peters
Proceedings of the Geometry, Topology, and Machine Learning Workshop, PMLR 325:69-78, 2026.

Abstract

Extract-Transform-Load (ETL) processes are core components of modern data processing infrastructures. The throughput of processed data records can be adjusted by changing the amount of allocated resources, i.e. the number of parallel processing threads for each of the three ETL phases, but also depends on stochastic variations in the per-record processing times. In chains of multiple consecutive ETL processes, the relation between allocated resources and overall throughput is further complicated, for example by the occurrence of bottlenecks affecting all subsequent ETL processes. We develop a mathematical model of ETL process chains that is accurate at the level of time-aggregated throughput and suitable for efficient simulation. The process chain is represented as a controlled discrete-time Markov process on a directed acyclic graph whose edges are individual ETL processes. We model the mean throughput as a bounded, monotone function of the number of parallel threads, to capture the diminishing benefit of allocating more threads. We furthermore introduce a Flow Balance postulate linking number of threads, mean throughput, and mean processing time. The stochastic processing times are then modeled by non-negative heavy-tailed distributions around the mean processing time. This framework provides a principled simulator for ETL networks and a foundation for learning- and control-based resource allocation.

Cite this Paper


BibTeX
@InProceedings{pmlr-v325-maier26a, title = {Mathematical Foundations of Modeling ETL Process Chains}, author = {Maier, Levin and Schulze, Lucas and Lilow, Robert and Hahn, Lukas and Krasowski, Niko and Barth, Arnulf and Gaebel, Sebastian and Gueran, Ferdi and Wagner, Giovanni and Borgmann, Falk and Arenz, Oleg and Peters, Jan}, booktitle = {Proceedings of the Geometry, Topology, and Machine Learning Workshop}, pages = {69--78}, year = {2026}, editor = {Bleher, Michael and Jensen, Freya and Maier, Levin and Taha, Diaaeldin and Wienhard, Anna}, volume = {325}, series = {Proceedings of Machine Learning Research}, month = {10--14 Nov}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v325/main/assets/maier26a/maier26a.pdf}, url = {https://proceedings.mlr.press/v325/maier26a.html}, abstract = {Extract-Transform-Load (ETL) processes are core components of modern data processing infrastructures. The throughput of processed data records can be adjusted by changing the amount of allocated resources, i.e. the number of parallel processing threads for each of the three ETL phases, but also depends on stochastic variations in the per-record processing times. In chains of multiple consecutive ETL processes, the relation between allocated resources and overall throughput is further complicated, for example by the occurrence of bottlenecks affecting all subsequent ETL processes. We develop a mathematical model of ETL process chains that is accurate at the level of time-aggregated throughput and suitable for efficient simulation. The process chain is represented as a controlled discrete-time Markov process on a directed acyclic graph whose edges are individual ETL processes. We model the mean throughput as a bounded, monotone function of the number of parallel threads, to capture the diminishing benefit of allocating more threads. We furthermore introduce a Flow Balance postulate linking number of threads, mean throughput, and mean processing time. The stochastic processing times are then modeled by non-negative heavy-tailed distributions around the mean processing time. This framework provides a principled simulator for ETL networks and a foundation for learning- and control-based resource allocation.} }
Endnote
%0 Conference Paper %T Mathematical Foundations of Modeling ETL Process Chains %A Levin Maier %A Lucas Schulze %A Robert Lilow %A Lukas Hahn %A Niko Krasowski %A Arnulf Barth %A Sebastian Gaebel %A Ferdi Gueran %A Giovanni Wagner %A Falk Borgmann %A Oleg Arenz %A Jan Peters %B Proceedings of the Geometry, Topology, and Machine Learning Workshop %C Proceedings of Machine Learning Research %D 2026 %E Michael Bleher %E Freya Jensen %E Levin Maier %E Diaaeldin Taha %E Anna Wienhard %F pmlr-v325-maier26a %I PMLR %P 69--78 %U https://proceedings.mlr.press/v325/maier26a.html %V 325 %X Extract-Transform-Load (ETL) processes are core components of modern data processing infrastructures. The throughput of processed data records can be adjusted by changing the amount of allocated resources, i.e. the number of parallel processing threads for each of the three ETL phases, but also depends on stochastic variations in the per-record processing times. In chains of multiple consecutive ETL processes, the relation between allocated resources and overall throughput is further complicated, for example by the occurrence of bottlenecks affecting all subsequent ETL processes. We develop a mathematical model of ETL process chains that is accurate at the level of time-aggregated throughput and suitable for efficient simulation. The process chain is represented as a controlled discrete-time Markov process on a directed acyclic graph whose edges are individual ETL processes. We model the mean throughput as a bounded, monotone function of the number of parallel threads, to capture the diminishing benefit of allocating more threads. We furthermore introduce a Flow Balance postulate linking number of threads, mean throughput, and mean processing time. The stochastic processing times are then modeled by non-negative heavy-tailed distributions around the mean processing time. This framework provides a principled simulator for ETL networks and a foundation for learning- and control-based resource allocation.
APA
Maier, L., Schulze, L., Lilow, R., Hahn, L., Krasowski, N., Barth, A., Gaebel, S., Gueran, F., Wagner, G., Borgmann, F., Arenz, O. & Peters, J.. (2026). Mathematical Foundations of Modeling ETL Process Chains. Proceedings of the Geometry, Topology, and Machine Learning Workshop, in Proceedings of Machine Learning Research 325:69-78 Available from https://proceedings.mlr.press/v325/maier26a.html.

Related Material