A two-stream convolution architecture for ESC based on audio feature distanglement

Zhenghao Chang, Ruhan He, Yongsheng Yu, Zili Zhang, GeLi Bai
Proceedings of The 14th Asian Conference on Machine Learning, PMLR 189:153-168, 2023.

Abstract

ESC (Environmental Sound Classification) is an active area of research in the field of audio classification that has made significant progress in recent years. The current mainstream ESC methods are based on increasing the dimension of the extracted audio features and therefore draw on the two-dimensional convolution methods used in image processing. However, two-dimensional convolution is expensive to train and the complexity of the corresponding model is usually very high. In response to these issues, we propose a novel two-stream neural network model by the idea of disentanglement, which uses onedimensional convolution for feature extraction to disentangle the audio features into the time and frequency domains separately. Our approach balances computational pressure with classification accuracy well. The accuracy of our approach on the Urbansound 8k and Esc-10 datasets was 98.51% and 97.50%, respectively, which exceeds that of most models. Meanwhile, the model complexity is also lower.

Cite this Paper


BibTeX
@InProceedings{pmlr-v189-chang23a, title = {A two-stream convolution architecture for ESC based on audio feature distanglement}, author = {Chang, Zhenghao and He, Ruhan and Yu, Yongsheng and Zhang, Zili and Bai, GeLi}, booktitle = {Proceedings of The 14th Asian Conference on Machine Learning}, pages = {153--168}, year = {2023}, editor = {Khan, Emtiyaz and Gonen, Mehmet}, volume = {189}, series = {Proceedings of Machine Learning Research}, month = {12--14 Dec}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v189/chang23a/chang23a.pdf}, url = {https://proceedings.mlr.press/v189/chang23a.html}, abstract = { ESC (Environmental Sound Classification) is an active area of research in the field of audio classification that has made significant progress in recent years. The current mainstream ESC methods are based on increasing the dimension of the extracted audio features and therefore draw on the two-dimensional convolution methods used in image processing. However, two-dimensional convolution is expensive to train and the complexity of the corresponding model is usually very high. In response to these issues, we propose a novel two-stream neural network model by the idea of disentanglement, which uses onedimensional convolution for feature extraction to disentangle the audio features into the time and frequency domains separately. Our approach balances computational pressure with classification accuracy well. The accuracy of our approach on the Urbansound 8k and Esc-10 datasets was 98.51% and 97.50%, respectively, which exceeds that of most models. Meanwhile, the model complexity is also lower.} }
Endnote
%0 Conference Paper %T A two-stream convolution architecture for ESC based on audio feature distanglement %A Zhenghao Chang %A Ruhan He %A Yongsheng Yu %A Zili Zhang %A GeLi Bai %B Proceedings of The 14th Asian Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Emtiyaz Khan %E Mehmet Gonen %F pmlr-v189-chang23a %I PMLR %P 153--168 %U https://proceedings.mlr.press/v189/chang23a.html %V 189 %X ESC (Environmental Sound Classification) is an active area of research in the field of audio classification that has made significant progress in recent years. The current mainstream ESC methods are based on increasing the dimension of the extracted audio features and therefore draw on the two-dimensional convolution methods used in image processing. However, two-dimensional convolution is expensive to train and the complexity of the corresponding model is usually very high. In response to these issues, we propose a novel two-stream neural network model by the idea of disentanglement, which uses onedimensional convolution for feature extraction to disentangle the audio features into the time and frequency domains separately. Our approach balances computational pressure with classification accuracy well. The accuracy of our approach on the Urbansound 8k and Esc-10 datasets was 98.51% and 97.50%, respectively, which exceeds that of most models. Meanwhile, the model complexity is also lower.
APA
Chang, Z., He, R., Yu, Y., Zhang, Z. & Bai, G.. (2023). A two-stream convolution architecture for ESC based on audio feature distanglement. Proceedings of The 14th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 189:153-168 Available from https://proceedings.mlr.press/v189/chang23a.html.

Related Material