Discriminative Topic Segmentation of Text and Speech

Mehryar Mohri; Pedro Moreno; Eugene Weinstein

Discriminative Topic Segmentation of Text and Speech

Mehryar Mohri, Pedro Moreno, Eugene Weinstein

Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, PMLR 9:533-540, 2010.

Abstract

We explore automated discovery of topically-coherent segments in speech or text sequences. We give two new discriminative topic segmentation algorithms which employ a new measure of text similarity based on word co-occurrence. Both algorithms function by finding extrema in the similarity signal over the text, with the latter algorithm using a compact support-vector based description of a window of text or speech observations in word similarity space to overcome noise introduced by speech recognition errors and off-topic content. In experiments over speech and text news streams, we show that these algorithms outperform previous methods. We observe that topic segmentation of speech recognizer output is a more difficult problem than that of text streams; however, we demonstrate that by using a lattice of competing hypotheses rather than just the one-best hypothesis as input to the segmentation algorithm, the performance of the algorithm can be improved.

Cite this Paper

BibTeX


@InProceedings{pmlr-v9-mohri10a,
  title = 	 {Discriminative Topic Segmentation of Text and Speech},
  author = 	 {Mohri, Mehryar and Moreno, Pedro and Weinstein, Eugene},
  booktitle = 	 {Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics},
  pages = 	 {533--540},
  year = 	 {2010},
  editor = 	 {Teh, Yee Whye and Titterington, Mike},
  volume = 	 {9},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Chia Laguna Resort, Sardinia, Italy},
  month = 	 {13--15 May},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v9/mohri10a/mohri10a.pdf},
  url = 	 {https://proceedings.mlr.press/v9/mohri10a.html},
  abstract = 	 {We explore automated discovery of topically-coherent segments in speech or text sequences. We give two new discriminative topic segmentation algorithms which employ a new measure of text similarity based on word co-occurrence. Both algorithms function by finding extrema in the similarity signal over the text, with the latter algorithm using a compact support-vector based description of a window of text or speech observations in word similarity space to overcome noise introduced by speech recognition errors and off-topic content. In experiments over speech and text news streams, we show that these algorithms outperform previous methods. We observe that topic segmentation of speech recognizer output is a more difficult problem than that of text streams; however, we demonstrate that by using a lattice of competing hypotheses rather than just the one-best hypothesis as input to the segmentation algorithm, the performance of the algorithm can be improved.}
}

Endnote

%0 Conference Paper
%T Discriminative Topic Segmentation of Text and Speech
%A Mehryar Mohri
%A Pedro Moreno
%A Eugene Weinstein
%B Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics
%C Proceedings of Machine Learning Research
%D 2010
%E Yee Whye Teh
%E Mike Titterington	
%F pmlr-v9-mohri10a
%I PMLR
%P 533--540
%U https://proceedings.mlr.press/v9/mohri10a.html
%V 9
%X We explore automated discovery of topically-coherent segments in speech or text sequences. We give two new discriminative topic segmentation algorithms which employ a new measure of text similarity based on word co-occurrence. Both algorithms function by finding extrema in the similarity signal over the text, with the latter algorithm using a compact support-vector based description of a window of text or speech observations in word similarity space to overcome noise introduced by speech recognition errors and off-topic content. In experiments over speech and text news streams, we show that these algorithms outperform previous methods. We observe that topic segmentation of speech recognizer output is a more difficult problem than that of text streams; however, we demonstrate that by using a lattice of competing hypotheses rather than just the one-best hypothesis as input to the segmentation algorithm, the performance of the algorithm can be improved.

RIS


TY  - CPAPER
TI  - Discriminative Topic Segmentation of Text and Speech
AU  - Mehryar Mohri
AU  - Pedro Moreno
AU  - Eugene Weinstein
BT  - Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics
DA  - 2010/03/31
ED  - Yee Whye Teh
ED  - Mike Titterington	
ID  - pmlr-v9-mohri10a
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 9
SP  - 533
EP  - 540
L1  - http://proceedings.mlr.press/v9/mohri10a/mohri10a.pdf
UR  - https://proceedings.mlr.press/v9/mohri10a.html
AB  - We explore automated discovery of topically-coherent segments in speech or text sequences. We give two new discriminative topic segmentation algorithms which employ a new measure of text similarity based on word co-occurrence. Both algorithms function by finding extrema in the similarity signal over the text, with the latter algorithm using a compact support-vector based description of a window of text or speech observations in word similarity space to overcome noise introduced by speech recognition errors and off-topic content. In experiments over speech and text news streams, we show that these algorithms outperform previous methods. We observe that topic segmentation of speech recognizer output is a more difficult problem than that of text streams; however, we demonstrate that by using a lattice of competing hypotheses rather than just the one-best hypothesis as input to the segmentation algorithm, the performance of the algorithm can be improved.
ER  -

APA


Mohri, M., Moreno, P. & Weinstein, E.. (2010). Discriminative Topic Segmentation of Text and Speech. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 9:533-540 Available from https://proceedings.mlr.press/v9/mohri10a.html.

Related Material

Download PDF