DA-KD: Difficulty-Aware Knowledge Distillation for Efficient Large Language Models

Changyi He, Yifu Ding, Jinyang Guo, Ruihao Gong, Haotong Qin, Xianglong Liu
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:22379-22391, 2025.

Abstract

Although knowledge distillation (KD) is an effective approach to improve the performance of a smaller LLM (i.e., the student model) by transferring knowledge from a large LLM (i.e., the teacher model), it still suffers from high training cost. Existing LLM distillation methods ignore the difficulty difference among different samples, making the distillation of easy samples unnecessary. This leads to high distillation cost. In this paper, we propose difficulty-aware knowledge distillation (DA-KD) framework for efficient knowledge distillation, in which we dynamically adjust the distillation dataset based on the difficulty of samples. We further observe existing KD loss cannot perform well when most of samples are difficult in the distillation dataset because of unstable optimization and the neglect of hard samples. Therefore, we also propose a new KD loss called bidirectional discrepancy loss (BDL) for effective KD. Extensive experiments demonstrate that our DA-KD framework is effective and efficient. Without bells and whistles, DA-KD can outperform existing state-of-the-art KD methods by 2% with half training cost and even surpass the teacher model with 4.7$\times$ compression.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-he25c, title = {{DA}-{KD}: Difficulty-Aware Knowledge Distillation for Efficient Large Language Models}, author = {He, Changyi and Ding, Yifu and Guo, Jinyang and Gong, Ruihao and Qin, Haotong and Liu, Xianglong}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {22379--22391}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/he25c/he25c.pdf}, url = {https://proceedings.mlr.press/v267/he25c.html}, abstract = {Although knowledge distillation (KD) is an effective approach to improve the performance of a smaller LLM (i.e., the student model) by transferring knowledge from a large LLM (i.e., the teacher model), it still suffers from high training cost. Existing LLM distillation methods ignore the difficulty difference among different samples, making the distillation of easy samples unnecessary. This leads to high distillation cost. In this paper, we propose difficulty-aware knowledge distillation (DA-KD) framework for efficient knowledge distillation, in which we dynamically adjust the distillation dataset based on the difficulty of samples. We further observe existing KD loss cannot perform well when most of samples are difficult in the distillation dataset because of unstable optimization and the neglect of hard samples. Therefore, we also propose a new KD loss called bidirectional discrepancy loss (BDL) for effective KD. Extensive experiments demonstrate that our DA-KD framework is effective and efficient. Without bells and whistles, DA-KD can outperform existing state-of-the-art KD methods by 2% with half training cost and even surpass the teacher model with 4.7$\times$ compression.} }
Endnote
%0 Conference Paper %T DA-KD: Difficulty-Aware Knowledge Distillation for Efficient Large Language Models %A Changyi He %A Yifu Ding %A Jinyang Guo %A Ruihao Gong %A Haotong Qin %A Xianglong Liu %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-he25c %I PMLR %P 22379--22391 %U https://proceedings.mlr.press/v267/he25c.html %V 267 %X Although knowledge distillation (KD) is an effective approach to improve the performance of a smaller LLM (i.e., the student model) by transferring knowledge from a large LLM (i.e., the teacher model), it still suffers from high training cost. Existing LLM distillation methods ignore the difficulty difference among different samples, making the distillation of easy samples unnecessary. This leads to high distillation cost. In this paper, we propose difficulty-aware knowledge distillation (DA-KD) framework for efficient knowledge distillation, in which we dynamically adjust the distillation dataset based on the difficulty of samples. We further observe existing KD loss cannot perform well when most of samples are difficult in the distillation dataset because of unstable optimization and the neglect of hard samples. Therefore, we also propose a new KD loss called bidirectional discrepancy loss (BDL) for effective KD. Extensive experiments demonstrate that our DA-KD framework is effective and efficient. Without bells and whistles, DA-KD can outperform existing state-of-the-art KD methods by 2% with half training cost and even surpass the teacher model with 4.7$\times$ compression.
APA
He, C., Ding, Y., Guo, J., Gong, R., Qin, H. & Liu, X.. (2025). DA-KD: Difficulty-Aware Knowledge Distillation for Efficient Large Language Models. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:22379-22391 Available from https://proceedings.mlr.press/v267/he25c.html.

Related Material