Modeling speech emotion with label variance and analyzing performance across speakers and unseen acoustic conditions

Vikramjit Mitra, Amrit Romana, Dung Tran, Erdrin Azemi
Proceedings on "I Can't Believe It's Not Better: Challenges in Applied Deep Learning" at ICLR 2025 Workshops, PMLR 296:26-36, 2025.

Abstract

Spontaneous speech emotion data usually contain perceptual grades where graders assign emotion score after listening to the speech files. Such perceptual grades introduce uncertainty in labels due to grader opinion variation. Grader variation is addressed by using consensus grades as groundtruth, where the emotion with the highest vote is selected. Consensus grades fail to consider ambiguous instances where a speech sample may contain multiple emotions, as captured through grader opinion uncertainty. We demonstrate that using the probability density function of the emotion grades as targets instead of the commonly used consensus grades, provide better performance on benchmark evaluation sets compared to results reported in the literature. We show that a saliency driven foundation model (FM) representation selection helps to train a state-of-the-art speech emotion model for both dimensional and categorical emotion recognition. Comparing representations obtained from different FMs, we observed that focusing on overall test-set performance can be deceiving, as it fails to reveal the models generalization capacity across speakers and gender. We demonstrate that performance evaluation across multiple test-sets and performance analysis across gender and speakers are useful in assessing usefulness of emotion models. Finally, we demonstrate that label uncertainty and data-skew pose a challenge to model evaluation, where instead of using the best hypothesis, it is useful to consider the 2- or 3-best hypotheses.

Cite this Paper


BibTeX
@InProceedings{pmlr-v296-mitra25a, title = {Modeling speech emotion with label variance and analyzing performance across speakers and unseen acoustic conditions}, author = {Mitra, Vikramjit and Romana, Amrit and Tran, Dung and Azemi, Erdrin}, booktitle = {Proceedings on "I Can't Believe It's Not Better: Challenges in Applied Deep Learning" at ICLR 2025 Workshops}, pages = {26--36}, year = {2025}, editor = {Blaas, Arno and D’Costa, Priya and Feng, Fan and Kriegler, Andreas and Mason, Ian and Pan, Zhaoying and Uelwer, Tobias and Williams, Jennifer and Xie, Yubin and Yang, Rui}, volume = {296}, series = {Proceedings of Machine Learning Research}, month = {28 Apr}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v296/main/assets/mitra25a/mitra25a.pdf}, url = {https://proceedings.mlr.press/v296/mitra25a.html}, abstract = {Spontaneous speech emotion data usually contain perceptual grades where graders assign emotion score after listening to the speech files. Such perceptual grades introduce uncertainty in labels due to grader opinion variation. Grader variation is addressed by using consensus grades as groundtruth, where the emotion with the highest vote is selected. Consensus grades fail to consider ambiguous instances where a speech sample may contain multiple emotions, as captured through grader opinion uncertainty. We demonstrate that using the probability density function of the emotion grades as targets instead of the commonly used consensus grades, provide better performance on benchmark evaluation sets compared to results reported in the literature. We show that a saliency driven foundation model (FM) representation selection helps to train a state-of-the-art speech emotion model for both dimensional and categorical emotion recognition. Comparing representations obtained from different FMs, we observed that focusing on overall test-set performance can be deceiving, as it fails to reveal the models generalization capacity across speakers and gender. We demonstrate that performance evaluation across multiple test-sets and performance analysis across gender and speakers are useful in assessing usefulness of emotion models. Finally, we demonstrate that label uncertainty and data-skew pose a challenge to model evaluation, where instead of using the best hypothesis, it is useful to consider the 2- or 3-best hypotheses.} }
Endnote
%0 Conference Paper %T Modeling speech emotion with label variance and analyzing performance across speakers and unseen acoustic conditions %A Vikramjit Mitra %A Amrit Romana %A Dung Tran %A Erdrin Azemi %B Proceedings on "I Can't Believe It's Not Better: Challenges in Applied Deep Learning" at ICLR 2025 Workshops %C Proceedings of Machine Learning Research %D 2025 %E Arno Blaas %E Priya D’Costa %E Fan Feng %E Andreas Kriegler %E Ian Mason %E Zhaoying Pan %E Tobias Uelwer %E Jennifer Williams %E Yubin Xie %E Rui Yang %F pmlr-v296-mitra25a %I PMLR %P 26--36 %U https://proceedings.mlr.press/v296/mitra25a.html %V 296 %X Spontaneous speech emotion data usually contain perceptual grades where graders assign emotion score after listening to the speech files. Such perceptual grades introduce uncertainty in labels due to grader opinion variation. Grader variation is addressed by using consensus grades as groundtruth, where the emotion with the highest vote is selected. Consensus grades fail to consider ambiguous instances where a speech sample may contain multiple emotions, as captured through grader opinion uncertainty. We demonstrate that using the probability density function of the emotion grades as targets instead of the commonly used consensus grades, provide better performance on benchmark evaluation sets compared to results reported in the literature. We show that a saliency driven foundation model (FM) representation selection helps to train a state-of-the-art speech emotion model for both dimensional and categorical emotion recognition. Comparing representations obtained from different FMs, we observed that focusing on overall test-set performance can be deceiving, as it fails to reveal the models generalization capacity across speakers and gender. We demonstrate that performance evaluation across multiple test-sets and performance analysis across gender and speakers are useful in assessing usefulness of emotion models. Finally, we demonstrate that label uncertainty and data-skew pose a challenge to model evaluation, where instead of using the best hypothesis, it is useful to consider the 2- or 3-best hypotheses.
APA
Mitra, V., Romana, A., Tran, D. & Azemi, E.. (2025). Modeling speech emotion with label variance and analyzing performance across speakers and unseen acoustic conditions. Proceedings on "I Can't Believe It's Not Better: Challenges in Applied Deep Learning" at ICLR 2025 Workshops, in Proceedings of Machine Learning Research 296:26-36 Available from https://proceedings.mlr.press/v296/mitra25a.html.

Related Material