The Case for Learned Provenance-based System Behavior Baseline

Yao Zhu; Zhenyuan Li; Yangyang Wei; Shouling Ji

The Case for Learned Provenance-based System Behavior Baseline

Yao Zhu, Zhenyuan Li, Yangyang Wei, Shouling Ji

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:79903-79917, 2025.

Abstract

Provenance graphs describe data flows and causal dependencies of host activities, enabling to track the data propagation and manipulation throughout the systems, which provide a foundation for intrusion detection. However, these Provenance-based Intrusion Detection Systems (PIDSes) face significant challenges in storage, representation, and analysis, which impede the efficacy of machine learning models such as Graph Neural Networks (GNNs) in processing and learning from these graphs. This paper presents a novel learning-based anomaly detection method designed to efficiently embed and analyze large-scale provenance graphs. Our approach integrates dynamic graph processing with adaptive encoding, facilitating compact embeddings that effectively address out-of-vocabulary (OOV) elements and adapt to normality shifts in dynamic real-world environments. Subsequently, we incorporate this refined baseline into a tag-propagation framework for real-time detection. Our evaluation demonstrates the method’s accuracy and adaptability in anomaly path mining, significantly advancing the state-of-the-art in handling and analyzing provenance graphs for anomaly detection.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-zhu25k,
  title = 	 {The Case for Learned Provenance-based System Behavior Baseline},
  author =       {Zhu, Yao and Li, Zhenyuan and Wei, Yangyang and Ji, Shouling},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {79903--79917},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/zhu25k/zhu25k.pdf},
  url = 	 {https://proceedings.mlr.press/v267/zhu25k.html},
  abstract = 	 {Provenance graphs describe data flows and causal dependencies of host activities, enabling to track the data propagation and manipulation throughout the systems, which provide a foundation for intrusion detection. However, these Provenance-based Intrusion Detection Systems (PIDSes) face significant challenges in storage, representation, and analysis, which impede the efficacy of machine learning models such as Graph Neural Networks (GNNs) in processing and learning from these graphs. This paper presents a novel learning-based anomaly detection method designed to efficiently embed and analyze large-scale provenance graphs. Our approach integrates dynamic graph processing with adaptive encoding, facilitating compact embeddings that effectively address out-of-vocabulary (OOV) elements and adapt to normality shifts in dynamic real-world environments. Subsequently, we incorporate this refined baseline into a tag-propagation framework for real-time detection. Our evaluation demonstrates the method’s accuracy and adaptability in anomaly path mining, significantly advancing the state-of-the-art in handling and analyzing provenance graphs for anomaly detection.}
}

Endnote

%0 Conference Paper
%T The Case for Learned Provenance-based System Behavior Baseline
%A Yao Zhu
%A Zhenyuan Li
%A Yangyang Wei
%A Shouling Ji
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-zhu25k
%I PMLR
%P 79903--79917
%U https://proceedings.mlr.press/v267/zhu25k.html
%V 267
%X Provenance graphs describe data flows and causal dependencies of host activities, enabling to track the data propagation and manipulation throughout the systems, which provide a foundation for intrusion detection. However, these Provenance-based Intrusion Detection Systems (PIDSes) face significant challenges in storage, representation, and analysis, which impede the efficacy of machine learning models such as Graph Neural Networks (GNNs) in processing and learning from these graphs. This paper presents a novel learning-based anomaly detection method designed to efficiently embed and analyze large-scale provenance graphs. Our approach integrates dynamic graph processing with adaptive encoding, facilitating compact embeddings that effectively address out-of-vocabulary (OOV) elements and adapt to normality shifts in dynamic real-world environments. Subsequently, we incorporate this refined baseline into a tag-propagation framework for real-time detection. Our evaluation demonstrates the method’s accuracy and adaptability in anomaly path mining, significantly advancing the state-of-the-art in handling and analyzing provenance graphs for anomaly detection.

APA

Zhu, Y., Li, Z., Wei, Y. & Ji, S.. (2025). The Case for Learned Provenance-based System Behavior Baseline. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:79903-79917 Available from https://proceedings.mlr.press/v267/zhu25k.html.

The Case for Learned Provenance-based System Behavior Baseline

Abstract

Cite this Paper

Related Material