Probabilistic acoustic tube: a probabilistic generative model of speech for speech analysis/synthesis

Zhijian Ou, Yang Zhang
Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, PMLR 22:841-849, 2012.

Abstract

Most speech analysis/synthesis systems are based on the basic physical model of speech production - the acoustic tube model. There are two main drawbacks with current speech analysis methods. First, a common design paradigm seems to build a special-purpose signal-processing front-end followed by (when appropriate) a back-end based on probabilistic models. A difficulty is that most features are nonlinear operators of the speech waveform, whose statistical behavior is hard to be modeled. Second, different tasks of speech analysis are carried out separately. These practices are admittedly useful but not optimal due to the incomplete use of available information. These examinations motivate us to directly model the spectrogram and to integrate together the three fundamental speech parameters - the pitch, energy and spectral envelope. We successfully devise such a model called probabilistic acoustic tube (PAT) model. The integration is performed in a principled manner with explicit physical meaning. We demonstrate the capability of PAT for a number of speech analysis/synthesis tasks, such as pitch tracking under both clean and additive noise conditions, speech synthesis, and phoneme clustering.

Cite this Paper


BibTeX
@InProceedings{pmlr-v22-ou12, title = {Probabilistic acoustic tube: a probabilistic generative model of speech for speech analysis/synthesis}, author = {Ou, Zhijian and Zhang, Yang}, booktitle = {Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics}, pages = {841--849}, year = {2012}, editor = {Lawrence, Neil D. and Girolami, Mark}, volume = {22}, series = {Proceedings of Machine Learning Research}, address = {La Palma, Canary Islands}, month = {21--23 Apr}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v22/ou12/ou12.pdf}, url = {https://proceedings.mlr.press/v22/ou12.html}, abstract = {Most speech analysis/synthesis systems are based on the basic physical model of speech production - the acoustic tube model. There are two main drawbacks with current speech analysis methods. First, a common design paradigm seems to build a special-purpose signal-processing front-end followed by (when appropriate) a back-end based on probabilistic models. A difficulty is that most features are nonlinear operators of the speech waveform, whose statistical behavior is hard to be modeled. Second, different tasks of speech analysis are carried out separately. These practices are admittedly useful but not optimal due to the incomplete use of available information. These examinations motivate us to directly model the spectrogram and to integrate together the three fundamental speech parameters - the pitch, energy and spectral envelope. We successfully devise such a model called probabilistic acoustic tube (PAT) model. The integration is performed in a principled manner with explicit physical meaning. We demonstrate the capability of PAT for a number of speech analysis/synthesis tasks, such as pitch tracking under both clean and additive noise conditions, speech synthesis, and phoneme clustering.} }
Endnote
%0 Conference Paper %T Probabilistic acoustic tube: a probabilistic generative model of speech for speech analysis/synthesis %A Zhijian Ou %A Yang Zhang %B Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2012 %E Neil D. Lawrence %E Mark Girolami %F pmlr-v22-ou12 %I PMLR %P 841--849 %U https://proceedings.mlr.press/v22/ou12.html %V 22 %X Most speech analysis/synthesis systems are based on the basic physical model of speech production - the acoustic tube model. There are two main drawbacks with current speech analysis methods. First, a common design paradigm seems to build a special-purpose signal-processing front-end followed by (when appropriate) a back-end based on probabilistic models. A difficulty is that most features are nonlinear operators of the speech waveform, whose statistical behavior is hard to be modeled. Second, different tasks of speech analysis are carried out separately. These practices are admittedly useful but not optimal due to the incomplete use of available information. These examinations motivate us to directly model the spectrogram and to integrate together the three fundamental speech parameters - the pitch, energy and spectral envelope. We successfully devise such a model called probabilistic acoustic tube (PAT) model. The integration is performed in a principled manner with explicit physical meaning. We demonstrate the capability of PAT for a number of speech analysis/synthesis tasks, such as pitch tracking under both clean and additive noise conditions, speech synthesis, and phoneme clustering.
RIS
TY - CPAPER TI - Probabilistic acoustic tube: a probabilistic generative model of speech for speech analysis/synthesis AU - Zhijian Ou AU - Yang Zhang BT - Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics DA - 2012/03/21 ED - Neil D. Lawrence ED - Mark Girolami ID - pmlr-v22-ou12 PB - PMLR DP - Proceedings of Machine Learning Research VL - 22 SP - 841 EP - 849 L1 - http://proceedings.mlr.press/v22/ou12/ou12.pdf UR - https://proceedings.mlr.press/v22/ou12.html AB - Most speech analysis/synthesis systems are based on the basic physical model of speech production - the acoustic tube model. There are two main drawbacks with current speech analysis methods. First, a common design paradigm seems to build a special-purpose signal-processing front-end followed by (when appropriate) a back-end based on probabilistic models. A difficulty is that most features are nonlinear operators of the speech waveform, whose statistical behavior is hard to be modeled. Second, different tasks of speech analysis are carried out separately. These practices are admittedly useful but not optimal due to the incomplete use of available information. These examinations motivate us to directly model the spectrogram and to integrate together the three fundamental speech parameters - the pitch, energy and spectral envelope. We successfully devise such a model called probabilistic acoustic tube (PAT) model. The integration is performed in a principled manner with explicit physical meaning. We demonstrate the capability of PAT for a number of speech analysis/synthesis tasks, such as pitch tracking under both clean and additive noise conditions, speech synthesis, and phoneme clustering. ER -
APA
Ou, Z. & Zhang, Y.. (2012). Probabilistic acoustic tube: a probabilistic generative model of speech for speech analysis/synthesis. Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 22:841-849 Available from https://proceedings.mlr.press/v22/ou12.html.

Related Material