Training and Cross-Validating Machine Learning Pipelines with Limited Memory

Martin Hirzel, Kiran Kate, Louis Mandel, Avraham Shinnar
Proceedings of the Third International Conference on Automated Machine Learning, PMLR 256:13/1-25, 2024.

Abstract

While automated machine learning (AutoML) can save human labor in finding well-performing pipelines, it often suffers from two problems: overfitting and using excessive resources. Unfortunately, the solutions are often at odds: cross-validation helps reduce overfitting at the expense of more resources; conversely, preprocessing on a separate compute cluster and then cross-validating only the final predictor saves resources at the expense of more overfitting. This paper shows how to train and cross-validate entire pipelines on a single moderate machine with limited memory by using monoids, which are associative, thus providing a flexible way for handling large data one batch at a time. To facilitate AutoML, our approach is designed to support the common sklearn APIs used by many AutoML systems for pipelines, training, cross-validation, and several operators. Abstracted behind those APIs, our approach uses task graphs to extend the benefits of monoids from operators to pipelines, and provides a multi-backend implementation. Overall, our approach lets users train and cross-validate pipelines on simple and inexpensive compute infrastructure.

Cite this Paper


BibTeX
@InProceedings{pmlr-v256-hirzel24a, title = {Training and Cross-Validating Machine Learning Pipelines with Limited Memory}, author = {Hirzel, Martin and Kate, Kiran and Mandel, Louis and Shinnar, Avraham}, booktitle = {Proceedings of the Third International Conference on Automated Machine Learning}, pages = {13/1--25}, year = {2024}, editor = {Eggensperger, Katharina and Garnett, Roman and Vanschoren, Joaquin and Lindauer, Marius and Gardner, Jacob R.}, volume = {256}, series = {Proceedings of Machine Learning Research}, month = {09--12 Sep}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v256/main/assets/hirzel24a/hirzel24a.pdf}, url = {https://proceedings.mlr.press/v256/hirzel24a.html}, abstract = {While automated machine learning (AutoML) can save human labor in finding well-performing pipelines, it often suffers from two problems: overfitting and using excessive resources. Unfortunately, the solutions are often at odds: cross-validation helps reduce overfitting at the expense of more resources; conversely, preprocessing on a separate compute cluster and then cross-validating only the final predictor saves resources at the expense of more overfitting. This paper shows how to train and cross-validate entire pipelines on a single moderate machine with limited memory by using monoids, which are associative, thus providing a flexible way for handling large data one batch at a time. To facilitate AutoML, our approach is designed to support the common sklearn APIs used by many AutoML systems for pipelines, training, cross-validation, and several operators. Abstracted behind those APIs, our approach uses task graphs to extend the benefits of monoids from operators to pipelines, and provides a multi-backend implementation. Overall, our approach lets users train and cross-validate pipelines on simple and inexpensive compute infrastructure.} }
Endnote
%0 Conference Paper %T Training and Cross-Validating Machine Learning Pipelines with Limited Memory %A Martin Hirzel %A Kiran Kate %A Louis Mandel %A Avraham Shinnar %B Proceedings of the Third International Conference on Automated Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Katharina Eggensperger %E Roman Garnett %E Joaquin Vanschoren %E Marius Lindauer %E Jacob R. Gardner %F pmlr-v256-hirzel24a %I PMLR %P 13/1--25 %U https://proceedings.mlr.press/v256/hirzel24a.html %V 256 %X While automated machine learning (AutoML) can save human labor in finding well-performing pipelines, it often suffers from two problems: overfitting and using excessive resources. Unfortunately, the solutions are often at odds: cross-validation helps reduce overfitting at the expense of more resources; conversely, preprocessing on a separate compute cluster and then cross-validating only the final predictor saves resources at the expense of more overfitting. This paper shows how to train and cross-validate entire pipelines on a single moderate machine with limited memory by using monoids, which are associative, thus providing a flexible way for handling large data one batch at a time. To facilitate AutoML, our approach is designed to support the common sklearn APIs used by many AutoML systems for pipelines, training, cross-validation, and several operators. Abstracted behind those APIs, our approach uses task graphs to extend the benefits of monoids from operators to pipelines, and provides a multi-backend implementation. Overall, our approach lets users train and cross-validate pipelines on simple and inexpensive compute infrastructure.
APA
Hirzel, M., Kate, K., Mandel, L. & Shinnar, A.. (2024). Training and Cross-Validating Machine Learning Pipelines with Limited Memory. Proceedings of the Third International Conference on Automated Machine Learning, in Proceedings of Machine Learning Research 256:13/1-25 Available from https://proceedings.mlr.press/v256/hirzel24a.html.

Related Material