Discriminative Topic Segmentation of Text and Speech

Mehryar Mohri, Pedro Moreno, Eugene Weinstein
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, PMLR 9:533-540, 2010.

Abstract

We explore automated discovery of topically-coherent segments in speech or text sequences. We give two new discriminative topic segmentation algorithms which employ a new measure of text similarity based on word co-occurrence. Both algorithms function by finding extrema in the similarity signal over the text, with the latter algorithm using a compact support-vector based description of a window of text or speech observations in word similarity space to overcome noise introduced by speech recognition errors and off-topic content. In experiments over speech and text news streams, we show that these algorithms outperform previous methods. We observe that topic segmentation of speech recognizer output is a more difficult problem than that of text streams; however, we demonstrate that by using a lattice of competing hypotheses rather than just the one-best hypothesis as input to the segmentation algorithm, the performance of the algorithm can be improved.

Cite this Paper


BibTeX
@InProceedings{pmlr-v9-mohri10a, title = {Discriminative Topic Segmentation of Text and Speech}, author = {Mohri, Mehryar and Moreno, Pedro and Weinstein, Eugene}, booktitle = {Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics}, pages = {533--540}, year = {2010}, editor = {Teh, Yee Whye and Titterington, Mike}, volume = {9}, series = {Proceedings of Machine Learning Research}, address = {Chia Laguna Resort, Sardinia, Italy}, month = {13--15 May}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v9/mohri10a/mohri10a.pdf}, url = {https://proceedings.mlr.press/v9/mohri10a.html}, abstract = {We explore automated discovery of topically-coherent segments in speech or text sequences. We give two new discriminative topic segmentation algorithms which employ a new measure of text similarity based on word co-occurrence. Both algorithms function by finding extrema in the similarity signal over the text, with the latter algorithm using a compact support-vector based description of a window of text or speech observations in word similarity space to overcome noise introduced by speech recognition errors and off-topic content. In experiments over speech and text news streams, we show that these algorithms outperform previous methods. We observe that topic segmentation of speech recognizer output is a more difficult problem than that of text streams; however, we demonstrate that by using a lattice of competing hypotheses rather than just the one-best hypothesis as input to the segmentation algorithm, the performance of the algorithm can be improved.} }
Endnote
%0 Conference Paper %T Discriminative Topic Segmentation of Text and Speech %A Mehryar Mohri %A Pedro Moreno %A Eugene Weinstein %B Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2010 %E Yee Whye Teh %E Mike Titterington %F pmlr-v9-mohri10a %I PMLR %P 533--540 %U https://proceedings.mlr.press/v9/mohri10a.html %V 9 %X We explore automated discovery of topically-coherent segments in speech or text sequences. We give two new discriminative topic segmentation algorithms which employ a new measure of text similarity based on word co-occurrence. Both algorithms function by finding extrema in the similarity signal over the text, with the latter algorithm using a compact support-vector based description of a window of text or speech observations in word similarity space to overcome noise introduced by speech recognition errors and off-topic content. In experiments over speech and text news streams, we show that these algorithms outperform previous methods. We observe that topic segmentation of speech recognizer output is a more difficult problem than that of text streams; however, we demonstrate that by using a lattice of competing hypotheses rather than just the one-best hypothesis as input to the segmentation algorithm, the performance of the algorithm can be improved.
RIS
TY - CPAPER TI - Discriminative Topic Segmentation of Text and Speech AU - Mehryar Mohri AU - Pedro Moreno AU - Eugene Weinstein BT - Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics DA - 2010/03/31 ED - Yee Whye Teh ED - Mike Titterington ID - pmlr-v9-mohri10a PB - PMLR DP - Proceedings of Machine Learning Research VL - 9 SP - 533 EP - 540 L1 - http://proceedings.mlr.press/v9/mohri10a/mohri10a.pdf UR - https://proceedings.mlr.press/v9/mohri10a.html AB - We explore automated discovery of topically-coherent segments in speech or text sequences. We give two new discriminative topic segmentation algorithms which employ a new measure of text similarity based on word co-occurrence. Both algorithms function by finding extrema in the similarity signal over the text, with the latter algorithm using a compact support-vector based description of a window of text or speech observations in word similarity space to overcome noise introduced by speech recognition errors and off-topic content. In experiments over speech and text news streams, we show that these algorithms outperform previous methods. We observe that topic segmentation of speech recognizer output is a more difficult problem than that of text streams; however, we demonstrate that by using a lattice of competing hypotheses rather than just the one-best hypothesis as input to the segmentation algorithm, the performance of the algorithm can be improved. ER -
APA
Mohri, M., Moreno, P. & Weinstein, E.. (2010). Discriminative Topic Segmentation of Text and Speech. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 9:533-540 Available from https://proceedings.mlr.press/v9/mohri10a.html.

Related Material