Diffusion-based Adversarial Attack to Automatic Speech  Recognition

Ying Wang; Yuchuan Luo; Shaojing Fu; Zhenyu Qiu; Lin Liu

Diffusion-based Adversarial Attack to Automatic Speech Recognition

Ying Wang, Yuchuan Luo, Shaojing Fu, Zhenyu Qiu, Lin Liu

Proceedings of the 16th Asian Conference on Machine Learning, PMLR 260:889-904, 2025.

Abstract

Recent studies have exposed the substantial vulnerability of voice-activated smart devices to adversarial examples, predominantly targeting the robustness of automatic speech recognition (ASR) systems. Most of adversarial examples generated by introducing adversarial perturbations within the

$l_p$ norm bounds to benign audio inputs. However, these attacks are constrained by the parametric bounds of perturbations or the features of disturbance, which limits their effectiveness. To improve the acoustic realism of adversarial examples and enhance attack performance, we propose a novel attack framework called Diffusion-based Adversarial Attack, leveraging DiffVC, a diffusion-based voice conversion model, to map audio to a latent space and employing Adversarial Latent Perturbation (ALP) to embed less perceptible and more robust perturbations. Extensive evaluations demonstrate that our method enhances targeted attack performance. Notably, the Word Error Rate (WER) has shown an average increase of 101 absolute points over clean speech audio and 25 absolute points over C&W attack. Additionally, the Success Rate (SR) has achieved an average increase of 11 absolute points over the C&W attack and 16 absolute points over SSA attack. Additionally, our approach also stands out for its high audio quality and efficiency.

Cite this Paper

BibTeX

@InProceedings{pmlr-v260-wang25f,
  title = 	 {Diffusion-based Adversarial Attack to Automatic Speech  Recognition},
  author =       {Wang, Ying and Luo, Yuchuan and Fu, Shaojing and Qiu, Zhenyu and Liu, Lin},
  booktitle = 	 {Proceedings of the 16th Asian Conference on Machine Learning},
  pages = 	 {889--904},
  year = 	 {2025},
  editor = 	 {Nguyen, Vu and Lin, Hsuan-Tien},
  volume = 	 {260},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {05--08 Dec},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v260/main/assets/wang25f/wang25f.pdf},
  url = 	 {https://proceedings.mlr.press/v260/wang25f.html},
  abstract = 	 {Recent studies have exposed the substantial vulnerability of voice-activated smart devices to adversarial examples, predominantly targeting the robustness of automatic speech recognition (ASR) systems. Most of adversarial examples generated by introducing adversarial perturbations within the $l_p$ norm bounds to benign audio inputs. However, these attacks are constrained by the parametric bounds of perturbations or the features of disturbance, which limits their effectiveness. To improve the acoustic realism of adversarial examples and enhance attack performance, we propose a novel attack framework called Diffusion-based Adversarial Attack, leveraging DiffVC, a diffusion-based voice conversion model, to map audio to a latent space and employing Adversarial Latent Perturbation (ALP) to embed less perceptible and more robust perturbations. Extensive evaluations demonstrate that our method enhances targeted attack performance. Notably, the Word Error Rate (WER) has shown an average increase of 101 absolute points over clean speech audio and 25 absolute points over C&W attack. Additionally, the Success Rate (SR) has achieved an average increase of 11 absolute points over the C&W attack and 16 absolute points over SSA attack. Additionally, our approach also stands out for its high audio quality and efficiency.}
}

Endnote

%0 Conference Paper
%T Diffusion-based Adversarial Attack to Automatic Speech  Recognition
%A Ying Wang
%A Yuchuan Luo
%A Shaojing Fu
%A Zhenyu Qiu
%A Lin Liu
%B Proceedings of the 16th Asian Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Vu Nguyen
%E Hsuan-Tien Lin	
%F pmlr-v260-wang25f
%I PMLR
%P 889--904
%U https://proceedings.mlr.press/v260/wang25f.html
%V 260
%X Recent studies have exposed the substantial vulnerability of voice-activated smart devices to adversarial examples, predominantly targeting the robustness of automatic speech recognition (ASR) systems. Most of adversarial examples generated by introducing adversarial perturbations within the $l_p$ norm bounds to benign audio inputs. However, these attacks are constrained by the parametric bounds of perturbations or the features of disturbance, which limits their effectiveness. To improve the acoustic realism of adversarial examples and enhance attack performance, we propose a novel attack framework called Diffusion-based Adversarial Attack, leveraging DiffVC, a diffusion-based voice conversion model, to map audio to a latent space and employing Adversarial Latent Perturbation (ALP) to embed less perceptible and more robust perturbations. Extensive evaluations demonstrate that our method enhances targeted attack performance. Notably, the Word Error Rate (WER) has shown an average increase of 101 absolute points over clean speech audio and 25 absolute points over C&W attack. Additionally, the Success Rate (SR) has achieved an average increase of 11 absolute points over the C&W attack and 16 absolute points over SSA attack. Additionally, our approach also stands out for its high audio quality and efficiency.

APA

Wang, Y., Luo, Y., Fu, S., Qiu, Z. & Liu, L.. (2025). Diffusion-based Adversarial Attack to Automatic Speech  Recognition. Proceedings of the 16th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 260:889-904 Available from https://proceedings.mlr.press/v260/wang25f.html.

Diffusion-based Adversarial Attack to Automatic Speech Recognition

Abstract

Cite this Paper

Related Material