[edit]
Exploring One Million Machine Learning Pipelines: A Benchmarking Study
Proceedings of the Fourth International Conference on Automated Machine Learning, PMLR 293:22/1-34, 2025.
Abstract
Machine learning solutions are largely affected by the values of the hyperparameters of their algorithms. This has motivated a large number of recent research projects on hyperparameter tuning, with the proposal of several, and highly diverse, tuning approaches. Rather than proposing a new approach or identifying the most effective hyperparameter tuning approach, this paper looks for good machine learning solutions by exploring machine learning pipelines. For such, it benchmarks pipelines focusing on the interaction between feature preprocessing techniques and classification models. The study evaluates the effectiveness of pipeline combinations, identifying high-performing and underperforming combinations. Additionally, it provides meta-knowledge datasets without any optimization selection bias to foster research contributions in meta-learning, accelerating the development of meta-models. The findings provide insights into the most effective preprocessing and modeling combination, guiding practitioners and researchers in their selection processes.