A two-stream convolution architecture for ESC based
 on audio feature distanglement

Zhenghao Chang; Ruhan He; Yongsheng Yu; Zili Zhang; GeLi Bai

A two-stream convolution architecture for ESC based on audio feature distanglement

Zhenghao Chang, Ruhan He, Yongsheng Yu, Zili Zhang, GeLi Bai

Proceedings of The 14th Asian Conference on Machine Learning, PMLR 189:153-168, 2023.

Abstract

ESC (Environmental Sound Classification) is an active area of research in the field of audio classification that has made significant progress in recent years. The current mainstream ESC methods are based on increasing the dimension of the extracted audio features and therefore draw on the two-dimensional convolution methods used in image processing. However, two-dimensional convolution is expensive to train and the complexity of the corresponding model is usually very high. In response to these issues, we propose a novel two-stream neural network model by the idea of disentanglement, which uses onedimensional convolution for feature extraction to disentangle the audio features into the time and frequency domains separately. Our approach balances computational pressure with classification accuracy well. The accuracy of our approach on the Urbansound 8k and Esc-10 datasets was 98.51% and 97.50%, respectively, which exceeds that of most models. Meanwhile, the model complexity is also lower.

Cite this Paper

BibTeX

@InProceedings{pmlr-v189-chang23a,
  title = 	 {A two-stream convolution architecture for ESC based
 on audio feature distanglement},
  author =       {Chang, Zhenghao and He, Ruhan and Yu, Yongsheng and Zhang, Zili and Bai, GeLi},
  booktitle = 	 {Proceedings of The 14th Asian Conference on Machine
 Learning},
  pages = 	 {153--168},
  year = 	 {2023},
  editor = 	 {Khan, Emtiyaz and Gonen, Mehmet},
  volume = 	 {189},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {12--14 Dec},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v189/chang23a/chang23a.pdf},
  url = 	 {https://proceedings.mlr.press/v189/chang23a.html},
  abstract = 	 { ESC (Environmental Sound Classification) is an
 active area of research in the field of audio
 classification that has made significant progress in
 recent years. The current mainstream ESC methods are
 based on increasing the dimension of the extracted
 audio features and therefore draw on the
 two-dimensional convolution methods used in image
 processing. However, two-dimensional convolution is
 expensive to train and the complexity of the
 corresponding model is usually very high. In
 response to these issues, we propose a novel
 two-stream neural network model by the idea of
 disentanglement, which uses onedimensional
 convolution for feature extraction to disentangle
 the audio features into the time and frequency
 domains separately. Our approach balances
 computational pressure with classification accuracy
 well. The accuracy of our approach on the Urbansound
 8k and Esc-10 datasets was 98.51% and 97.50%,
 respectively, which exceeds that of most
 models. Meanwhile, the model complexity is also
 lower.}
}

Endnote

%0 Conference Paper
%T A two-stream convolution architecture for ESC based
 on audio feature distanglement
%A Zhenghao Chang
%A Ruhan He
%A Yongsheng Yu
%A Zili Zhang
%A GeLi Bai
%B Proceedings of The 14th Asian Conference on Machine
 Learning
%C Proceedings of Machine Learning Research
%D 2023
%E Emtiyaz Khan
%E Mehmet Gonen	
%F pmlr-v189-chang23a
%I PMLR
%P 153--168
%U https://proceedings.mlr.press/v189/chang23a.html
%V 189
%X  ESC (Environmental Sound Classification) is an
 active area of research in the field of audio
 classification that has made significant progress in
 recent years. The current mainstream ESC methods are
 based on increasing the dimension of the extracted
 audio features and therefore draw on the
 two-dimensional convolution methods used in image
 processing. However, two-dimensional convolution is
 expensive to train and the complexity of the
 corresponding model is usually very high. In
 response to these issues, we propose a novel
 two-stream neural network model by the idea of
 disentanglement, which uses onedimensional
 convolution for feature extraction to disentangle
 the audio features into the time and frequency
 domains separately. Our approach balances
 computational pressure with classification accuracy
 well. The accuracy of our approach on the Urbansound
 8k and Esc-10 datasets was 98.51% and 97.50%,
 respectively, which exceeds that of most
 models. Meanwhile, the model complexity is also
 lower.

APA

Chang, Z., He, R., Yu, Y., Zhang, Z. & Bai, G.. (2023). A two-stream convolution architecture for ESC based
 on audio feature distanglement. Proceedings of The 14th Asian Conference on Machine
 Learning, in Proceedings of Machine Learning Research 189:153-168 Available from https://proceedings.mlr.press/v189/chang23a.html.

Related Material

Download PDF