GraTeD-MLP: Efficient Node Classification via Graph Transformer Distillation to MLP

Sarthak Malik, Aditi Rai, Ram Ganesh V, Himank Sehgal, Akshay Sethi, Aakarsh Malhotra
Proceedings of the Third Learning on Graphs Conference, PMLR 269:20:1-20:15, 2025.

Abstract

Graph Transformers (GTs) like NAGphormer have shown impressive performance by encoding graph’s structural information and node features. However, their self-attention and complex architectures require high computation and memory, hindering their deployment. Thus, we propose a novel framework called Graph Transformer Distillation to Multi-Layer Perceptron (GraTeD-MLP). GraTeD-MLP leverages knowledge distillation (KD) and a novel decomposition of attentional representation to distill the learned representations from the teacher GT to a student MLP. During distillation, we incorporate a gated MLP architecture where two branches learn the decomposed attentional representation for a node while the third predicts node embeddings. Encoding the attentional representation mitigates the MLP’s over-reliance on node features, enabling robust performance even in inductive settings. Empirical results demonstrate that the proposed GraTeD-MLP has significantly faster inference time than the teacher GT model, with speed-up ranging from 20\texttimes -40\texttimes . With up to 25% improved performance over vanilla MLP. Furthermore, we empirically show that the proposed GraTeD-MLP outperforms other GNN distillation methods in seven datasets in both inductive and transductive settings

Cite this Paper


BibTeX
@InProceedings{pmlr-v269-malik25a, title = {GraTeD-MLP: Efficient Node Classification via Graph Transformer Distillation to MLP}, author = {Malik, Sarthak and Rai, Aditi and V, Ram Ganesh and Sehgal, Himank and Sethi, Akshay and Malhotra, Aakarsh}, booktitle = {Proceedings of the Third Learning on Graphs Conference}, pages = {20:1--20:15}, year = {2025}, editor = {Wolf, Guy and Krishnaswamy, Smita}, volume = {269}, series = {Proceedings of Machine Learning Research}, month = {26--29 Nov}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v269/main/assets/malik25a/malik25a.pdf}, url = {https://proceedings.mlr.press/v269/malik25a.html}, abstract = {Graph Transformers (GTs) like NAGphormer have shown impressive performance by encoding graph’s structural information and node features. However, their self-attention and complex architectures require high computation and memory, hindering their deployment. Thus, we propose a novel framework called Graph Transformer Distillation to Multi-Layer Perceptron (GraTeD-MLP). GraTeD-MLP leverages knowledge distillation (KD) and a novel decomposition of attentional representation to distill the learned representations from the teacher GT to a student MLP. During distillation, we incorporate a gated MLP architecture where two branches learn the decomposed attentional representation for a node while the third predicts node embeddings. Encoding the attentional representation mitigates the MLP’s over-reliance on node features, enabling robust performance even in inductive settings. Empirical results demonstrate that the proposed GraTeD-MLP has significantly faster inference time than the teacher GT model, with speed-up ranging from 20\texttimes -40\texttimes . With up to 25% improved performance over vanilla MLP. Furthermore, we empirically show that the proposed GraTeD-MLP outperforms other GNN distillation methods in seven datasets in both inductive and transductive settings} }
Endnote
%0 Conference Paper %T GraTeD-MLP: Efficient Node Classification via Graph Transformer Distillation to MLP %A Sarthak Malik %A Aditi Rai %A Ram Ganesh V %A Himank Sehgal %A Akshay Sethi %A Aakarsh Malhotra %B Proceedings of the Third Learning on Graphs Conference %C Proceedings of Machine Learning Research %D 2025 %E Guy Wolf %E Smita Krishnaswamy %F pmlr-v269-malik25a %I PMLR %P 20:1--20:15 %U https://proceedings.mlr.press/v269/malik25a.html %V 269 %X Graph Transformers (GTs) like NAGphormer have shown impressive performance by encoding graph’s structural information and node features. However, their self-attention and complex architectures require high computation and memory, hindering their deployment. Thus, we propose a novel framework called Graph Transformer Distillation to Multi-Layer Perceptron (GraTeD-MLP). GraTeD-MLP leverages knowledge distillation (KD) and a novel decomposition of attentional representation to distill the learned representations from the teacher GT to a student MLP. During distillation, we incorporate a gated MLP architecture where two branches learn the decomposed attentional representation for a node while the third predicts node embeddings. Encoding the attentional representation mitigates the MLP’s over-reliance on node features, enabling robust performance even in inductive settings. Empirical results demonstrate that the proposed GraTeD-MLP has significantly faster inference time than the teacher GT model, with speed-up ranging from 20\texttimes -40\texttimes . With up to 25% improved performance over vanilla MLP. Furthermore, we empirically show that the proposed GraTeD-MLP outperforms other GNN distillation methods in seven datasets in both inductive and transductive settings
APA
Malik, S., Rai, A., V, R.G., Sehgal, H., Sethi, A. & Malhotra, A.. (2025). GraTeD-MLP: Efficient Node Classification via Graph Transformer Distillation to MLP. Proceedings of the Third Learning on Graphs Conference, in Proceedings of Machine Learning Research 269:20:1-20:15 Available from https://proceedings.mlr.press/v269/malik25a.html.

Related Material