Learning anomalies from graph: predicting compute node failures on HPC clusters

Joze M. Rozanec; Roy Krumpak; Martin Molan; Andrea Bartolini

Learning anomalies from graph: predicting compute node failures on HPC clusters

Joze M. Rozanec, Roy Krumpak, Martin Molan, Andrea Bartolini

Proceedings of the 6th Northern Lights Deep Learning Conference (NLDL), PMLR 265:213-219, 2025.

Abstract

Today, high-performance computing (HPC) systems play a crucial role in advancing artificial intelligence. Nevertheless, the estimated global data center electricity consumption in 2022 was around 1% of the final global electricity demand. Therefore, as HPC systems advance towards Exascale computing, research is required to ensure their growth is sustainable and environmentally friendly. Data from infrastructure monitoring can be leveraged to predict downtimes, ensure these are treated in time, and increase the overall system’s utilization. In this paper, we compare four machine-learning approaches, three of them based on graph embeddings, to predict compute node downtimes. The experiments were performed with data from Marconi 100, a tier-0 production supercomputer at CINECA in Bologna, Italy. Our results show that the machine learning models can accurately predict downtime, matching current state-of-the-art models.

Cite this Paper

BibTeX

@InProceedings{pmlr-v265-rozanec25a,
  title = 	 {Learning anomalies from graph: predicting compute node failures on {HPC} clusters},
  author =       {Rozanec, Joze M. and Krumpak, Roy and Molan, Martin and Bartolini, Andrea},
  booktitle = 	 {Proceedings of the 6th Northern Lights Deep Learning Conference (NLDL)},
  pages = 	 {213--219},
  year = 	 {2025},
  editor = 	 {Lutchyn, Tetiana and Ramírez Rivera, Adín and Ricaud, Benjamin},
  volume = 	 {265},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {07--09 Jan},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v265/main/assets/rozanec25a/rozanec25a.pdf},
  url = 	 {https://proceedings.mlr.press/v265/rozanec25a.html},
  abstract = 	 {Today, high-performance computing (HPC) systems play a crucial role in advancing artificial intelligence. Nevertheless, the estimated global data center electricity consumption in 2022 was around 1% of the final global electricity demand. Therefore, as HPC systems advance towards Exascale computing, research is required to ensure their growth is sustainable and environmentally friendly. Data from infrastructure monitoring can be leveraged to predict downtimes, ensure these are treated in time, and increase the overall system’s utilization. In this paper, we compare four machine-learning approaches, three of them based on graph embeddings, to predict compute node downtimes. The experiments were performed with data from Marconi 100, a tier-0 production supercomputer at CINECA in Bologna, Italy. Our results show that the machine learning models can accurately predict downtime, matching current state-of-the-art models.}
}

Endnote

%0 Conference Paper
%T Learning anomalies from graph: predicting compute node failures on HPC clusters
%A Joze M. Rozanec
%A Roy Krumpak
%A Martin Molan
%A Andrea Bartolini
%B Proceedings of the 6th Northern Lights Deep Learning Conference (NLDL)
%C Proceedings of Machine Learning Research
%D 2025
%E Tetiana Lutchyn
%E Adín Ramírez Rivera
%E Benjamin Ricaud	
%F pmlr-v265-rozanec25a
%I PMLR
%P 213--219
%U https://proceedings.mlr.press/v265/rozanec25a.html
%V 265
%X Today, high-performance computing (HPC) systems play a crucial role in advancing artificial intelligence. Nevertheless, the estimated global data center electricity consumption in 2022 was around 1% of the final global electricity demand. Therefore, as HPC systems advance towards Exascale computing, research is required to ensure their growth is sustainable and environmentally friendly. Data from infrastructure monitoring can be leveraged to predict downtimes, ensure these are treated in time, and increase the overall system’s utilization. In this paper, we compare four machine-learning approaches, three of them based on graph embeddings, to predict compute node downtimes. The experiments were performed with data from Marconi 100, a tier-0 production supercomputer at CINECA in Bologna, Italy. Our results show that the machine learning models can accurately predict downtime, matching current state-of-the-art models.

APA

Rozanec, J.M., Krumpak, R., Molan, M. & Bartolini, A.. (2025). Learning anomalies from graph: predicting compute node failures on HPC clusters. Proceedings of the 6th Northern Lights Deep Learning Conference (NLDL), in Proceedings of Machine Learning Research 265:213-219 Available from https://proceedings.mlr.press/v265/rozanec25a.html.

Learning anomalies from graph: predicting compute node failures on HPC clusters

Abstract

Cite this Paper

Related Material