[edit]
A two-stream convolution architecture for ESC based on audio feature distanglement
Proceedings of The 14th Asian Conference on Machine
Learning, PMLR 189:153-168, 2023.
Abstract
ESC (Environmental Sound Classification) is an
active area of research in the field of audio
classification that has made significant progress in
recent years. The current mainstream ESC methods are
based on increasing the dimension of the extracted
audio features and therefore draw on the
two-dimensional convolution methods used in image
processing. However, two-dimensional convolution is
expensive to train and the complexity of the
corresponding model is usually very high. In
response to these issues, we propose a novel
two-stream neural network model by the idea of
disentanglement, which uses onedimensional
convolution for feature extraction to disentangle
the audio features into the time and frequency
domains separately. Our approach balances
computational pressure with classification accuracy
well. The accuracy of our approach on the Urbansound
8k and Esc-10 datasets was 98.51% and 97.50%,
respectively, which exceeds that of most
models. Meanwhile, the model complexity is also
lower.