Can You Hear Naples? Building and Benchmarking a Neapolitan Speech Corpus

Michael Cacioli, Liam Eggleston, Jatin Sarabu, Kevin Zhu
Proceedings of the AAAI 2026 Workshop on Audio-Centric AI: Towards Real-World Multimodal Reasoning and Application Use Cases (Audio-AAAI), PMLR 312:94-112, 2026.

Abstract

This paper presents the creation and analysis of the first spoken corpus for Neapolitan, a richly historic but under-resourced Romance dialect of Southern Italy. Despite its cultural importance, Neapolitan has been largely omitted from computational resources, limiting both dialectological research and the development of equitable speech technologies. We address this gap by creating the first structured spoken resource for Neapolitan, enabling systematic evaluation of dialectal ASR performance. Each clip was manually transcribed in orthographic Neapolitan and automatically aligned using OpenAI’s Whisper API, configured for standard Italian. To figure out how well Whisper transcribed the spoken Neapolitan sentences, we checked the outputs against the correct human-written texts using a few different methods. Specifically, we looked at how often the words matched (BLEU), how different the transcriptions were overall (normalized Levenshtein distance), and how closely the sets of words lined up (Jaccard similarity). We also used Word Error Rate (WER), but to make it easier to interpret, we converted it to similarity by subtracting from one (1–WER). A higher value means the transcription was more accurate. On average, this similarity measure came out very low, around 0.1306 ($\sigma$ = 0.1654), meaning roughly 87 percent of the words were transcribed incorrectly. The other evaluation measures told the same story: normalized Levenshtein similarity averaged around 0.6360, and Jaccard similarity was just 0.1078. This paper makes three crucial steps: (1) developed an easy-to-follow process anyone can use to build similar datasets for other dialects, (2) released the first openly accessible Neapolitan speech corpus, and (3) demonstrated just how critical it is to build ASR systems specifically trained on dialects, supporting not just computational linguistic research but also efforts to preserve these unique languages.

Cite this Paper


BibTeX
@InProceedings{pmlr-v312-cacioli26a, title = {Can You Hear Naples? Building and Benchmarking a Neapolitan Speech Corpus}, author = {Cacioli, Michael and Eggleston, Liam and Sarabu, Jatin and Zhu, Kevin}, booktitle = {Proceedings of the AAAI 2026 Workshop on Audio-Centric AI: Towards Real-World Multimodal Reasoning and Application Use Cases (Audio-AAAI)}, pages = {94--112}, year = {2026}, editor = {Komatsu, Tatsuya and Imoto, Keisuke and Gao, Xiaoxue and Ono, Nobutaka and Chen, Nancy F.}, volume = {312}, series = {Proceedings of Machine Learning Research}, month = {26 Jan}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v312/main/assets/cacioli26a/cacioli26a.pdf}, url = {https://proceedings.mlr.press/v312/cacioli26a.html}, abstract = {This paper presents the creation and analysis of the first spoken corpus for Neapolitan, a richly historic but under-resourced Romance dialect of Southern Italy. Despite its cultural importance, Neapolitan has been largely omitted from computational resources, limiting both dialectological research and the development of equitable speech technologies. We address this gap by creating the first structured spoken resource for Neapolitan, enabling systematic evaluation of dialectal ASR performance. Each clip was manually transcribed in orthographic Neapolitan and automatically aligned using OpenAI’s Whisper API, configured for standard Italian. To figure out how well Whisper transcribed the spoken Neapolitan sentences, we checked the outputs against the correct human-written texts using a few different methods. Specifically, we looked at how often the words matched (BLEU), how different the transcriptions were overall (normalized Levenshtein distance), and how closely the sets of words lined up (Jaccard similarity). We also used Word Error Rate (WER), but to make it easier to interpret, we converted it to similarity by subtracting from one (1–WER). A higher value means the transcription was more accurate. On average, this similarity measure came out very low, around 0.1306 ($\sigma$ = 0.1654), meaning roughly 87 percent of the words were transcribed incorrectly. The other evaluation measures told the same story: normalized Levenshtein similarity averaged around 0.6360, and Jaccard similarity was just 0.1078. This paper makes three crucial steps: (1) developed an easy-to-follow process anyone can use to build similar datasets for other dialects, (2) released the first openly accessible Neapolitan speech corpus, and (3) demonstrated just how critical it is to build ASR systems specifically trained on dialects, supporting not just computational linguistic research but also efforts to preserve these unique languages.} }
Endnote
%0 Conference Paper %T Can You Hear Naples? Building and Benchmarking a Neapolitan Speech Corpus %A Michael Cacioli %A Liam Eggleston %A Jatin Sarabu %A Kevin Zhu %B Proceedings of the AAAI 2026 Workshop on Audio-Centric AI: Towards Real-World Multimodal Reasoning and Application Use Cases (Audio-AAAI) %C Proceedings of Machine Learning Research %D 2026 %E Tatsuya Komatsu %E Keisuke Imoto %E Xiaoxue Gao %E Nobutaka Ono %E Nancy F. Chen %F pmlr-v312-cacioli26a %I PMLR %P 94--112 %U https://proceedings.mlr.press/v312/cacioli26a.html %V 312 %X This paper presents the creation and analysis of the first spoken corpus for Neapolitan, a richly historic but under-resourced Romance dialect of Southern Italy. Despite its cultural importance, Neapolitan has been largely omitted from computational resources, limiting both dialectological research and the development of equitable speech technologies. We address this gap by creating the first structured spoken resource for Neapolitan, enabling systematic evaluation of dialectal ASR performance. Each clip was manually transcribed in orthographic Neapolitan and automatically aligned using OpenAI’s Whisper API, configured for standard Italian. To figure out how well Whisper transcribed the spoken Neapolitan sentences, we checked the outputs against the correct human-written texts using a few different methods. Specifically, we looked at how often the words matched (BLEU), how different the transcriptions were overall (normalized Levenshtein distance), and how closely the sets of words lined up (Jaccard similarity). We also used Word Error Rate (WER), but to make it easier to interpret, we converted it to similarity by subtracting from one (1–WER). A higher value means the transcription was more accurate. On average, this similarity measure came out very low, around 0.1306 ($\sigma$ = 0.1654), meaning roughly 87 percent of the words were transcribed incorrectly. The other evaluation measures told the same story: normalized Levenshtein similarity averaged around 0.6360, and Jaccard similarity was just 0.1078. This paper makes three crucial steps: (1) developed an easy-to-follow process anyone can use to build similar datasets for other dialects, (2) released the first openly accessible Neapolitan speech corpus, and (3) demonstrated just how critical it is to build ASR systems specifically trained on dialects, supporting not just computational linguistic research but also efforts to preserve these unique languages.
APA
Cacioli, M., Eggleston, L., Sarabu, J. & Zhu, K.. (2026). Can You Hear Naples? Building and Benchmarking a Neapolitan Speech Corpus. Proceedings of the AAAI 2026 Workshop on Audio-Centric AI: Towards Real-World Multimodal Reasoning and Application Use Cases (Audio-AAAI), in Proceedings of Machine Learning Research 312:94-112 Available from https://proceedings.mlr.press/v312/cacioli26a.html.

Related Material