Enhancing Logits Distillation with Plug&Play Kendall’s $τ$ Ranking Loss

Yuchen Guan, Runxi Cheng, Kang Liu, Chun Yuan
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:20537-20552, 2025.

Abstract

Knowledge distillation typically minimizes the Kullback–Leibler (KL) divergence between teacher and student logits. However, optimizing the KL divergence can be challenging for the student and often leads to sub-optimal solutions. We further show that gradients induced by KL divergence scale with the magnitude of the teacher logits, thereby diminishing updates on low-probability channels. This imbalance weakens the transfer of inter-class information and in turn limits the performance improvements achievable by the student. To mitigate this issue, we propose a plug-and-play auxiliary ranking loss based on Kendall’s $\tau$ coefficient that can be seamlessly integrated into any logit-based distillation framework. It supplies inter-class relational information while rebalancing gradients toward low-probability channels. We demonstrate that the proposed ranking loss is largely invariant to channel scaling and optimizes an objective aligned with that of KL divergence, making it a natural complement rather than a replacement. Extensive experiments on CIFAR-100, ImageNet, and COCO datasets, as well as various CNN and ViT teacher-student architecture combinations, demonstrate that our plug-and-play ranking loss consistently boosts the performance of multiple distillation baselines.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-guan25a, title = {Enhancing Logits Distillation with Plug&Play Kendall’s $τ$ Ranking Loss}, author = {Guan, Yuchen and Cheng, Runxi and Liu, Kang and Yuan, Chun}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {20537--20552}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/guan25a/guan25a.pdf}, url = {https://proceedings.mlr.press/v267/guan25a.html}, abstract = {Knowledge distillation typically minimizes the Kullback–Leibler (KL) divergence between teacher and student logits. However, optimizing the KL divergence can be challenging for the student and often leads to sub-optimal solutions. We further show that gradients induced by KL divergence scale with the magnitude of the teacher logits, thereby diminishing updates on low-probability channels. This imbalance weakens the transfer of inter-class information and in turn limits the performance improvements achievable by the student. To mitigate this issue, we propose a plug-and-play auxiliary ranking loss based on Kendall’s $\tau$ coefficient that can be seamlessly integrated into any logit-based distillation framework. It supplies inter-class relational information while rebalancing gradients toward low-probability channels. We demonstrate that the proposed ranking loss is largely invariant to channel scaling and optimizes an objective aligned with that of KL divergence, making it a natural complement rather than a replacement. Extensive experiments on CIFAR-100, ImageNet, and COCO datasets, as well as various CNN and ViT teacher-student architecture combinations, demonstrate that our plug-and-play ranking loss consistently boosts the performance of multiple distillation baselines.} }
Endnote
%0 Conference Paper %T Enhancing Logits Distillation with Plug&Play Kendall’s $τ$ Ranking Loss %A Yuchen Guan %A Runxi Cheng %A Kang Liu %A Chun Yuan %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-guan25a %I PMLR %P 20537--20552 %U https://proceedings.mlr.press/v267/guan25a.html %V 267 %X Knowledge distillation typically minimizes the Kullback–Leibler (KL) divergence between teacher and student logits. However, optimizing the KL divergence can be challenging for the student and often leads to sub-optimal solutions. We further show that gradients induced by KL divergence scale with the magnitude of the teacher logits, thereby diminishing updates on low-probability channels. This imbalance weakens the transfer of inter-class information and in turn limits the performance improvements achievable by the student. To mitigate this issue, we propose a plug-and-play auxiliary ranking loss based on Kendall’s $\tau$ coefficient that can be seamlessly integrated into any logit-based distillation framework. It supplies inter-class relational information while rebalancing gradients toward low-probability channels. We demonstrate that the proposed ranking loss is largely invariant to channel scaling and optimizes an objective aligned with that of KL divergence, making it a natural complement rather than a replacement. Extensive experiments on CIFAR-100, ImageNet, and COCO datasets, as well as various CNN and ViT teacher-student architecture combinations, demonstrate that our plug-and-play ranking loss consistently boosts the performance of multiple distillation baselines.
APA
Guan, Y., Cheng, R., Liu, K. & Yuan, C.. (2025). Enhancing Logits Distillation with Plug&Play Kendall’s $τ$ Ranking Loss. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:20537-20552 Available from https://proceedings.mlr.press/v267/guan25a.html.

Related Material