Quantifying Knowledge Distillation using Partial Information Decomposition

Pasan Dissanayake; Faisal Hamman; Barproda Halder; Ilia Sucholutsky; Qiuyi Zhang; Sanghamitra Dutta

Quantifying Knowledge Distillation using Partial Information Decomposition

Pasan Dissanayake, Faisal Hamman, Barproda Halder, Ilia Sucholutsky, Qiuyi Zhang, Sanghamitra Dutta

Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, PMLR 258:4474-4482, 2025.

Abstract

Knowledge distillation deploys complex machine learning models in resource-constrained environments by training a smaller student model to emulate internal representations of a complex teacher model. However, the teacher’s representations can also encode nuisance or additional information not relevant to the downstream task. Distilling such irrelevant information can actually impede the performance of a capacity-limited student model. This observation motivates our primary question: What are the information-theoretic limits of knowledge distillation? To this end, we leverage Partial Information Decomposition to quantify and explain the transferred knowledge and knowledge left to distill for a downstream task. We theoretically demonstrate that the task-relevant transferred knowledge is succinctly captured by the measure of redundant information about the task between the teacher and student. We propose a novel multi-level optimization to incorporate redundant information as a regularizer, leading to our framework of Redundant Information Distillation (RID). RID leads to more resilient and effective distillation under nuisance teachers as it succinctly quantifies task-relevant knowledge rather than simply aligning student and teacher representations.

Cite this Paper

BibTeX

@InProceedings{pmlr-v258-dissanayake25a,
  title = 	 {Quantifying Knowledge Distillation using Partial Information Decomposition},
  author =       {Dissanayake, Pasan and Hamman, Faisal and Halder, Barproda and Sucholutsky, Ilia and Zhang, Qiuyi and Dutta, Sanghamitra},
  booktitle = 	 {Proceedings of The 28th International Conference on Artificial Intelligence and Statistics},
  pages = 	 {4474--4482},
  year = 	 {2025},
  editor = 	 {Li, Yingzhen and Mandt, Stephan and Agrawal, Shipra and Khan, Emtiyaz},
  volume = 	 {258},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {03--05 May},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v258/main/assets/dissanayake25a/dissanayake25a.pdf},
  url = 	 {https://proceedings.mlr.press/v258/dissanayake25a.html},
  abstract = 	 {Knowledge distillation deploys complex machine learning models in resource-constrained environments by training a smaller student model to emulate internal representations of a complex teacher model. However, the teacher’s representations can also encode nuisance or additional information not relevant to the downstream task. Distilling such irrelevant information can actually impede the performance of a capacity-limited student model. This observation motivates our primary question: What are the information-theoretic limits of knowledge distillation? To this end, we leverage Partial Information Decomposition to quantify and explain the transferred knowledge and knowledge left to distill for a downstream task. We theoretically demonstrate that the task-relevant transferred knowledge is succinctly captured by the measure of redundant information about the task between the teacher and student. We propose a novel multi-level optimization to incorporate redundant information as a regularizer, leading to our framework of Redundant Information Distillation (RID). RID leads to more resilient and effective distillation under nuisance teachers as it succinctly quantifies task-relevant knowledge rather than simply aligning student and teacher representations.}
}

Endnote

%0 Conference Paper
%T Quantifying Knowledge Distillation using Partial Information Decomposition
%A Pasan Dissanayake
%A Faisal Hamman
%A Barproda Halder
%A Ilia Sucholutsky
%A Qiuyi Zhang
%A Sanghamitra Dutta
%B Proceedings of The 28th International Conference on Artificial Intelligence and Statistics
%C Proceedings of Machine Learning Research
%D 2025
%E Yingzhen Li
%E Stephan Mandt
%E Shipra Agrawal
%E Emtiyaz Khan	
%F pmlr-v258-dissanayake25a
%I PMLR
%P 4474--4482
%U https://proceedings.mlr.press/v258/dissanayake25a.html
%V 258
%X Knowledge distillation deploys complex machine learning models in resource-constrained environments by training a smaller student model to emulate internal representations of a complex teacher model. However, the teacher’s representations can also encode nuisance or additional information not relevant to the downstream task. Distilling such irrelevant information can actually impede the performance of a capacity-limited student model. This observation motivates our primary question: What are the information-theoretic limits of knowledge distillation? To this end, we leverage Partial Information Decomposition to quantify and explain the transferred knowledge and knowledge left to distill for a downstream task. We theoretically demonstrate that the task-relevant transferred knowledge is succinctly captured by the measure of redundant information about the task between the teacher and student. We propose a novel multi-level optimization to incorporate redundant information as a regularizer, leading to our framework of Redundant Information Distillation (RID). RID leads to more resilient and effective distillation under nuisance teachers as it succinctly quantifies task-relevant knowledge rather than simply aligning student and teacher representations.

APA

Dissanayake, P., Hamman, F., Halder, B., Sucholutsky, I., Zhang, Q. & Dutta, S.. (2025). Quantifying Knowledge Distillation using Partial Information Decomposition. Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 258:4474-4482 Available from https://proceedings.mlr.press/v258/dissanayake25a.html.

Quantifying Knowledge Distillation using Partial Information Decomposition

Abstract

Cite this Paper

Related Material