Emerge and spread models and word burstiness

Peter Sunehag

Emerge and spread models and word burstiness

Peter Sunehag

Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, PMLR 2:540-547, 2007.

Abstract

Several authors have recently studied the problem of creating exchangeable models for natural languages that exhibit word burstiness. Word burstiness means that a word that has appeared once in a text should be more likely to appear again than it was to appear in the first place. In this article the different existing methods are compared theoretically through a unifying framework. New models that do not satisfy the exchangeability assumption but whose probability revisions only depend on the word counts of what has previously appeared, are introduced within this framework. We will refer to these models as two-stage conditional presence/abundance models since they, just like some recently introduced models for the abundance of rare species in ecology, seperate the issue of presence from the issue of abundance when present. We will see that the widely used TF-IDF heuristic for information retrieval follows naturally from these models by calculating a cross-entropy. We will also discuss a connection between TF-IDF and file formats that seperate presence from abundance given presence.

Cite this Paper

BibTeX

@InProceedings{pmlr-v2-sunehag07a,
  title = 	 {Emerge and spread models and word burstiness},
  author = 	 {Sunehag, Peter},
  booktitle = 	 {Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics},
  pages = 	 {540--547},
  year = 	 {2007},
  editor = 	 {Meila, Marina and Shen, Xiaotong},
  volume = 	 {2},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {San Juan, Puerto Rico},
  month = 	 {21--24 Mar},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v2/sunehag07a/sunehag07a.pdf},
  url = 	 {https://proceedings.mlr.press/v2/sunehag07a.html},
  abstract = 	 {Several authors have recently studied the problem of creating exchangeable models for natural languages that exhibit word burstiness. Word burstiness means that a word that has appeared once in a text should be more likely to appear again than it was to appear in the first place. In this article the different existing methods are compared theoretically through a unifying framework. New models that do not satisfy the exchangeability assumption but whose probability revisions only depend on the word counts of what has previously appeared, are introduced within this framework. We will refer to these models as two-stage conditional presence/abundance models since they, just like some recently introduced models for the abundance of rare species in ecology, seperate the issue of presence from the issue of abundance when present. We will see that the widely used TF-IDF heuristic for information retrieval follows naturally from these models by calculating a cross-entropy. We will also discuss a connection between TF-IDF and file formats that seperate presence from abundance given presence.}
}

Endnote

%0 Conference Paper
%T Emerge and spread models and word burstiness
%A Peter Sunehag
%B Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics
%C Proceedings of Machine Learning Research
%D 2007
%E Marina Meila
%E Xiaotong Shen	
%F pmlr-v2-sunehag07a
%I PMLR
%P 540--547
%U https://proceedings.mlr.press/v2/sunehag07a.html
%V 2
%X Several authors have recently studied the problem of creating exchangeable models for natural languages that exhibit word burstiness. Word burstiness means that a word that has appeared once in a text should be more likely to appear again than it was to appear in the first place. In this article the different existing methods are compared theoretically through a unifying framework. New models that do not satisfy the exchangeability assumption but whose probability revisions only depend on the word counts of what has previously appeared, are introduced within this framework. We will refer to these models as two-stage conditional presence/abundance models since they, just like some recently introduced models for the abundance of rare species in ecology, seperate the issue of presence from the issue of abundance when present. We will see that the widely used TF-IDF heuristic for information retrieval follows naturally from these models by calculating a cross-entropy. We will also discuss a connection between TF-IDF and file formats that seperate presence from abundance given presence.

RIS

TY  - CPAPER
TI  - Emerge and spread models and word burstiness
AU  - Peter Sunehag
BT  - Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics
DA  - 2007/03/11
ED  - Marina Meila
ED  - Xiaotong Shen	
ID  - pmlr-v2-sunehag07a
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 2
SP  - 540
EP  - 547
L1  - http://proceedings.mlr.press/v2/sunehag07a/sunehag07a.pdf
UR  - https://proceedings.mlr.press/v2/sunehag07a.html
AB  - Several authors have recently studied the problem of creating exchangeable models for natural languages that exhibit word burstiness. Word burstiness means that a word that has appeared once in a text should be more likely to appear again than it was to appear in the first place. In this article the different existing methods are compared theoretically through a unifying framework. New models that do not satisfy the exchangeability assumption but whose probability revisions only depend on the word counts of what has previously appeared, are introduced within this framework. We will refer to these models as two-stage conditional presence/abundance models since they, just like some recently introduced models for the abundance of rare species in ecology, seperate the issue of presence from the issue of abundance when present. We will see that the widely used TF-IDF heuristic for information retrieval follows naturally from these models by calculating a cross-entropy. We will also discuss a connection between TF-IDF and file formats that seperate presence from abundance given presence.
ER  -

APA

Sunehag, P.. (2007). Emerge and spread models and word burstiness. Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 2:540-547 Available from https://proceedings.mlr.press/v2/sunehag07a.html.

Related Material

Download PDF