[edit]
Tuning In: Comparative Analysis of Audio Classifier Performance in Clinical Settings with Limited Data
Proceedings of the fifth Conference on Health, Inference, and Learning, PMLR 248:446-460, 2024.
Abstract
This study assesses deep learning models for audio classification in a clinical setting with the constraint of small datasets reflecting the prospective collection of real-world data. We analyze CNNs, including DenseNet and ConvNeXt, alongside transformer models like ViT, and SWIN, and compare them against pretrained audio models such as AST, YAMNet and VGGish. Our method highlights the benefits of pretraining on large datasets before fine-tuning on specific clinical data. We prospectively collected two first-of-its-kind patient audio datasets from stroke patients. We investigated various preprocessing techniques, finding that RGB and grayscale spectrogram transformations affect model performance differently based on the priors they learn from pretraining. Our findings indicate CNNs can match or exceed transformer models in small dataset contexts, with DenseNet-Contrastive and AST models showing notable performance. This study highlights the significance of incremental marginal gains through model selection, pretraining, and preprocessing in sound classification; this offers valuable insights for clinical diagnostics that rely on audio classification.