Train multi-modal LLM to understand diverse speech paralinguistics by distilling from teacher with meta-information prompt

Jeremy Wong, Muhammad Huzaifah, Hardik Sailor, Shuo Sun, Kye Min Tan, Bin Wang, Qiongqiong Wang, Wenyu Zhang, Xunlong Zou, Nancy F. Chen, Ai Ti Aw
Proceedings of the AAAI 2026 Workshop on Audio-Centric AI: Towards Real-World Multimodal Reasoning and Application Use Cases (Audio-AAAI), PMLR 312:61-77, 2026.

Abstract

A Large Language Model (LLM) can be extended for input audio, by expressing the audio embeddings as an interpretable prompt to the LLM. The adaptor that computes these audio embeddings is often trained using multi-modal Instruction Fine-Tuning (IFT) data. It is labour intensive to scale up the creation of such data to many datasets, tasks, and audio information types. The labour can be reduced with Knowledge Distillation (KD), by prompting an external teacher LLM with meta-information from the audio dataset, and using the teacher’s output as a reference to train a student that is now prompted with the audio. Prior KD work has only used a few datasets, and does not present experiment comparisons against fair choices of IFT models. This paper scales up KD to a larger collection of datasets, comprising a wider variety of meta-information types. Fair experiment comparisons on Dynamic-SUPERB and AudioBench show that KD and IFT are complementary, but KD alone may not outperform IFT.

Cite this Paper


BibTeX
@InProceedings{pmlr-v312-wong26a, title = {Train multi-modal {LLM} to understand diverse speech paralinguistics by distilling from teacher with meta-information prompt}, author = {Wong, Jeremy and Huzaifah, Muhammad and Sailor, Hardik and Sun, Shuo and Tan, Kye Min and Wang, Bin and Wang, Qiongqiong and Zhang, Wenyu and Zou, Xunlong and Chen, Nancy F. and Aw, Ai Ti}, booktitle = {Proceedings of the AAAI 2026 Workshop on Audio-Centric AI: Towards Real-World Multimodal Reasoning and Application Use Cases (Audio-AAAI)}, pages = {61--77}, year = {2026}, editor = {Komatsu, Tatsuya and Imoto, Keisuke and Gao, Xiaoxue and Ono, Nobutaka and Chen, Nancy F.}, volume = {312}, series = {Proceedings of Machine Learning Research}, month = {26 Jan}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v312/main/assets/wong26a/wong26a.pdf}, url = {https://proceedings.mlr.press/v312/wong26a.html}, abstract = {A Large Language Model (LLM) can be extended for input audio, by expressing the audio embeddings as an interpretable prompt to the LLM. The adaptor that computes these audio embeddings is often trained using multi-modal Instruction Fine-Tuning (IFT) data. It is labour intensive to scale up the creation of such data to many datasets, tasks, and audio information types. The labour can be reduced with Knowledge Distillation (KD), by prompting an external teacher LLM with meta-information from the audio dataset, and using the teacher’s output as a reference to train a student that is now prompted with the audio. Prior KD work has only used a few datasets, and does not present experiment comparisons against fair choices of IFT models. This paper scales up KD to a larger collection of datasets, comprising a wider variety of meta-information types. Fair experiment comparisons on Dynamic-SUPERB and AudioBench show that KD and IFT are complementary, but KD alone may not outperform IFT.} }
Endnote
%0 Conference Paper %T Train multi-modal LLM to understand diverse speech paralinguistics by distilling from teacher with meta-information prompt %A Jeremy Wong %A Muhammad Huzaifah %A Hardik Sailor %A Shuo Sun %A Kye Min Tan %A Bin Wang %A Qiongqiong Wang %A Wenyu Zhang %A Xunlong Zou %A Nancy F. Chen %A Ai Ti Aw %B Proceedings of the AAAI 2026 Workshop on Audio-Centric AI: Towards Real-World Multimodal Reasoning and Application Use Cases (Audio-AAAI) %C Proceedings of Machine Learning Research %D 2026 %E Tatsuya Komatsu %E Keisuke Imoto %E Xiaoxue Gao %E Nobutaka Ono %E Nancy F. Chen %F pmlr-v312-wong26a %I PMLR %P 61--77 %U https://proceedings.mlr.press/v312/wong26a.html %V 312 %X A Large Language Model (LLM) can be extended for input audio, by expressing the audio embeddings as an interpretable prompt to the LLM. The adaptor that computes these audio embeddings is often trained using multi-modal Instruction Fine-Tuning (IFT) data. It is labour intensive to scale up the creation of such data to many datasets, tasks, and audio information types. The labour can be reduced with Knowledge Distillation (KD), by prompting an external teacher LLM with meta-information from the audio dataset, and using the teacher’s output as a reference to train a student that is now prompted with the audio. Prior KD work has only used a few datasets, and does not present experiment comparisons against fair choices of IFT models. This paper scales up KD to a larger collection of datasets, comprising a wider variety of meta-information types. Fair experiment comparisons on Dynamic-SUPERB and AudioBench show that KD and IFT are complementary, but KD alone may not outperform IFT.
APA
Wong, J., Huzaifah, M., Sailor, H., Sun, S., Tan, K.M., Wang, B., Wang, Q., Zhang, W., Zou, X., Chen, N.F. & Aw, A.T.. (2026). Train multi-modal LLM to understand diverse speech paralinguistics by distilling from teacher with meta-information prompt. Proceedings of the AAAI 2026 Workshop on Audio-Centric AI: Towards Real-World Multimodal Reasoning and Application Use Cases (Audio-AAAI), in Proceedings of Machine Learning Research 312:61-77 Available from https://proceedings.mlr.press/v312/wong26a.html.

Related Material