Training and Cross-Validating Machine Learning Pipelines with Limited Memory

Martin Hirzel; Kiran Kate; Louis Mandel; Avraham Shinnar

Training and Cross-Validating Machine Learning Pipelines with Limited Memory

Martin Hirzel, Kiran Kate, Louis Mandel, Avraham Shinnar

Proceedings of the Third International Conference on Automated Machine Learning, PMLR 256:13/1-25, 2024.

Abstract

While automated machine learning (AutoML) can save human labor in finding well-performing pipelines, it often suffers from two problems: overfitting and using excessive resources. Unfortunately, the solutions are often at odds: cross-validation helps reduce overfitting at the expense of more resources; conversely, preprocessing on a separate compute cluster and then cross-validating only the final predictor saves resources at the expense of more overfitting. This paper shows how to train and cross-validate entire pipelines on a single moderate machine with limited memory by using monoids, which are associative, thus providing a flexible way for handling large data one batch at a time. To facilitate AutoML, our approach is designed to support the common sklearn APIs used by many AutoML systems for pipelines, training, cross-validation, and several operators. Abstracted behind those APIs, our approach uses task graphs to extend the benefits of monoids from operators to pipelines, and provides a multi-backend implementation. Overall, our approach lets users train and cross-validate pipelines on simple and inexpensive compute infrastructure.

Cite this Paper

BibTeX


@InProceedings{pmlr-v256-hirzel24a,
  title = 	 {Training and Cross-Validating Machine Learning Pipelines with Limited Memory},
  author =       {Hirzel, Martin and Kate, Kiran and Mandel, Louis and Shinnar, Avraham},
  booktitle = 	 {Proceedings of the Third International Conference on Automated Machine Learning},
  pages = 	 {13/1--25},
  year = 	 {2024},
  editor = 	 {Eggensperger, Katharina and Garnett, Roman and Vanschoren, Joaquin and Lindauer, Marius and Gardner, Jacob R.},
  volume = 	 {256},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {09--12 Sep},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v256/main/assets/hirzel24a/hirzel24a.pdf},
  url = 	 {https://proceedings.mlr.press/v256/hirzel24a.html},
  abstract = 	 {While automated machine learning (AutoML) can save human labor in finding well-performing pipelines, it often suffers from two problems: overfitting and using excessive resources.  Unfortunately, the solutions are often at odds: cross-validation helps reduce overfitting at the expense of more resources; conversely, preprocessing on a separate compute cluster and then cross-validating only the final predictor saves resources at the expense of more overfitting.  This paper shows how to train and cross-validate entire pipelines on a single moderate machine with limited memory by using monoids, which are associative, thus providing a flexible way for handling large data one batch at a time.  To facilitate AutoML, our approach is designed to support the common sklearn APIs used by many AutoML systems for pipelines, training, cross-validation, and several operators.  Abstracted behind those APIs, our approach uses task graphs to extend the benefits of monoids from operators to pipelines, and provides a multi-backend implementation.  Overall, our approach lets users train and cross-validate pipelines on simple and inexpensive compute infrastructure.}
}

Endnote

%0 Conference Paper
%T Training and Cross-Validating Machine Learning Pipelines with Limited Memory
%A Martin Hirzel
%A Kiran Kate
%A Louis Mandel
%A Avraham Shinnar
%B Proceedings of the Third International Conference on Automated Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Katharina Eggensperger
%E Roman Garnett
%E Joaquin Vanschoren
%E Marius Lindauer
%E Jacob R. Gardner	
%F pmlr-v256-hirzel24a
%I PMLR
%P 13/1--25
%U https://proceedings.mlr.press/v256/hirzel24a.html
%V 256
%X While automated machine learning (AutoML) can save human labor in finding well-performing pipelines, it often suffers from two problems: overfitting and using excessive resources.  Unfortunately, the solutions are often at odds: cross-validation helps reduce overfitting at the expense of more resources; conversely, preprocessing on a separate compute cluster and then cross-validating only the final predictor saves resources at the expense of more overfitting.  This paper shows how to train and cross-validate entire pipelines on a single moderate machine with limited memory by using monoids, which are associative, thus providing a flexible way for handling large data one batch at a time.  To facilitate AutoML, our approach is designed to support the common sklearn APIs used by many AutoML systems for pipelines, training, cross-validation, and several operators.  Abstracted behind those APIs, our approach uses task graphs to extend the benefits of monoids from operators to pipelines, and provides a multi-backend implementation.  Overall, our approach lets users train and cross-validate pipelines on simple and inexpensive compute infrastructure.

APA


Hirzel, M., Kate, K., Mandel, L. & Shinnar, A.. (2024). Training and Cross-Validating Machine Learning Pipelines with Limited Memory. Proceedings of the Third International Conference on Automated Machine Learning, in Proceedings of Machine Learning Research 256:13/1-25 Available from https://proceedings.mlr.press/v256/hirzel24a.html.

Training and Cross-Validating Machine Learning Pipelines with Limited Memory

Abstract

Cite this Paper

Related Material