[edit]
Train multi-modal LLM to understand diverse speech paralinguistics by distilling from teacher with meta-information prompt
Proceedings of the AAAI 2026 Workshop on Audio-Centric AI: Towards Real-World Multimodal Reasoning and Application Use Cases (Audio-AAAI), PMLR 312:61-77, 2026.
Abstract
A Large Language Model (LLM) can be extended for input audio, by expressing the audio embeddings as an interpretable prompt to the LLM. The adaptor that computes these audio embeddings is often trained using multi-modal Instruction Fine-Tuning (IFT) data. It is labour intensive to scale up the creation of such data to many datasets, tasks, and audio information types. The labour can be reduced with Knowledge Distillation (KD), by prompting an external teacher LLM with meta-information from the audio dataset, and using the teacher’s output as a reference to train a student that is now prompted with the audio. Prior KD work has only used a few datasets, and does not present experiment comparisons against fair choices of IFT models. This paper scales up KD to a larger collection of datasets, comprising a wider variety of meta-information types. Fair experiment comparisons on Dynamic-SUPERB and AudioBench show that KD and IFT are complementary, but KD alone may not outperform IFT.