Lightweight Neural Networks for Speech Emotion Recognition using Layer-wise Adaptive Quantization

Tushar Shinde, Ritika Jain, Avinash Kumar Sharma
Proceedings of The 4th NeurIPS Efficient Natural Language and Speech Processing Workshop, PMLR 262:584-595, 2024.

Abstract

Speech Emotion Recognition (SER) systems are essential in advancing human-machine interaction. While deep learning models have shown substantial success in SER by eliminating the need for handcrafted features, their high computational and memory requirements, alongside intensive hyper-parameter optimization, limit their deployment on resource-constrained edge devices. To address these challenges, we introduce an optimized and computationally efficient Multilayer Perceptron (MLP)-based classifier within a custom SER framework. We further propose a novel, layer-wise adaptive quantization scheme that compresses the model by adjusting bit-width precision according to layer importance. This layer importance is calculated based on statistical measures such as parameter proportion, entropy, and weight variance within each layer. Our approach achieves an optimal balance between model size reduction and performance retention, ensuring that the quantized model maintains accuracy within acceptable limits. Traditional fixed-precision methods, while computationally simple, are less effective at reducing model size without compromising performance. In contrast, our scheme provides a more interpretable and computationally efficient solution. We evaluate the proposed model on standard SER datasets using features such as Mel-Frequency Cepstral Coefficients (MFCC), Chroma, and Mel-spectrogram. Experimental results demonstrate that our adaptive quantization method achieves performance competitive with state-of-the-art models while significantly reducing model size, making it highly suitable for deployment on edge devices.

Cite this Paper


BibTeX
@InProceedings{pmlr-v262-shinde24a, title = {Lightweight Neural Networks for Speech Emotion Recognition using Layer-wise Adaptive Quantization}, author = {Shinde, Tushar and Jain, Ritika and Kumar Sharma, Avinash}, booktitle = {Proceedings of The 4th NeurIPS Efficient Natural Language and Speech Processing Workshop}, pages = {584--595}, year = {2024}, editor = {Rezagholizadeh, Mehdi and Passban, Peyman and Samiee, Soheila and Partovi Nia, Vahid and Cheng, Yu and Deng, Yue and Liu, Qun and Chen, Boxing}, volume = {262}, series = {Proceedings of Machine Learning Research}, month = {14 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v262/main/assets/shinde24a/shinde24a.pdf}, url = {https://proceedings.mlr.press/v262/shinde24a.html}, abstract = {Speech Emotion Recognition (SER) systems are essential in advancing human-machine interaction. While deep learning models have shown substantial success in SER by eliminating the need for handcrafted features, their high computational and memory requirements, alongside intensive hyper-parameter optimization, limit their deployment on resource-constrained edge devices. To address these challenges, we introduce an optimized and computationally efficient Multilayer Perceptron (MLP)-based classifier within a custom SER framework. We further propose a novel, layer-wise adaptive quantization scheme that compresses the model by adjusting bit-width precision according to layer importance. This layer importance is calculated based on statistical measures such as parameter proportion, entropy, and weight variance within each layer. Our approach achieves an optimal balance between model size reduction and performance retention, ensuring that the quantized model maintains accuracy within acceptable limits. Traditional fixed-precision methods, while computationally simple, are less effective at reducing model size without compromising performance. In contrast, our scheme provides a more interpretable and computationally efficient solution. We evaluate the proposed model on standard SER datasets using features such as Mel-Frequency Cepstral Coefficients (MFCC), Chroma, and Mel-spectrogram. Experimental results demonstrate that our adaptive quantization method achieves performance competitive with state-of-the-art models while significantly reducing model size, making it highly suitable for deployment on edge devices.} }
Endnote
%0 Conference Paper %T Lightweight Neural Networks for Speech Emotion Recognition using Layer-wise Adaptive Quantization %A Tushar Shinde %A Ritika Jain %A Avinash Kumar Sharma %B Proceedings of The 4th NeurIPS Efficient Natural Language and Speech Processing Workshop %C Proceedings of Machine Learning Research %D 2024 %E Mehdi Rezagholizadeh %E Peyman Passban %E Soheila Samiee %E Vahid Partovi Nia %E Yu Cheng %E Yue Deng %E Qun Liu %E Boxing Chen %F pmlr-v262-shinde24a %I PMLR %P 584--595 %U https://proceedings.mlr.press/v262/shinde24a.html %V 262 %X Speech Emotion Recognition (SER) systems are essential in advancing human-machine interaction. While deep learning models have shown substantial success in SER by eliminating the need for handcrafted features, their high computational and memory requirements, alongside intensive hyper-parameter optimization, limit their deployment on resource-constrained edge devices. To address these challenges, we introduce an optimized and computationally efficient Multilayer Perceptron (MLP)-based classifier within a custom SER framework. We further propose a novel, layer-wise adaptive quantization scheme that compresses the model by adjusting bit-width precision according to layer importance. This layer importance is calculated based on statistical measures such as parameter proportion, entropy, and weight variance within each layer. Our approach achieves an optimal balance between model size reduction and performance retention, ensuring that the quantized model maintains accuracy within acceptable limits. Traditional fixed-precision methods, while computationally simple, are less effective at reducing model size without compromising performance. In contrast, our scheme provides a more interpretable and computationally efficient solution. We evaluate the proposed model on standard SER datasets using features such as Mel-Frequency Cepstral Coefficients (MFCC), Chroma, and Mel-spectrogram. Experimental results demonstrate that our adaptive quantization method achieves performance competitive with state-of-the-art models while significantly reducing model size, making it highly suitable for deployment on edge devices.
APA
Shinde, T., Jain, R. & Kumar Sharma, A.. (2024). Lightweight Neural Networks for Speech Emotion Recognition using Layer-wise Adaptive Quantization. Proceedings of The 4th NeurIPS Efficient Natural Language and Speech Processing Workshop, in Proceedings of Machine Learning Research 262:584-595 Available from https://proceedings.mlr.press/v262/shinde24a.html.

Related Material