[edit]
Semi-supervised Acoustic Scene Classification under Spatial-Temporal Variability with a CRNN-based Model
Proceedings of the AAAI 2026 Workshop on Audio-Centric AI: Towards Real-World Multimodal Reasoning and Application Use Cases (Audio-AAAI), PMLR 312:38-47, 2026.
Abstract
In this work, we present MobileASCNet, a lightweight CRNN-based model designed for acoustic scene classification (ASC) under spatial-temporal variability, as defined in the APSIPA ASC 2025 Grand Challenge. The model combines depthwise separable convolutions and ResNet-inspired residual blocks for efficient spatial feature extraction, and employs a gated recurrent unit (GRU) branch to capture temporal dependencies. City and time embeddings are fused to enhance context-awareness. We conduct extensive comparisons under different training strategies, including training from scratch, pretraining with fine-tuning, and feature freezing. Without relying on knowledge distillation, MobileASCNet achieves a competitive classification accuracy on the development set, with low model complexity.