Emerge and spread models and word burstiness

Peter Sunehag
; Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, PMLR 2:540-547, 2007.

Abstract

Several authors have recently studied the problem of creating exchangeable models for natural languages that exhibit word burstiness. Word burstiness means that a word that has appeared once in a text should be more likely to appear again than it was to appear in the first place. In this article the different existing methods are compared theoretically through a unifying framework. New models that do not satisfy the exchangeability assumption but whose probability revisions only depend on the word counts of what has previously appeared, are introduced within this framework. We will refer to these models as two-stage conditional presence/abundance models since they, just like some recently introduced models for the abundance of rare species in ecology, seperate the issue of presence from the issue of abundance when present. We will see that the widely used TF-IDF heuristic for information retrieval follows naturally from these models by calculating a crossentropy. We will also discuss a connection between TF-IDF and file formats that seperate presence from abundance given presence.

Cite this Paper


BibTeX
@InProceedings{pmlr-v2-sunehag07a, title = {Emerge and spread models and word burstiness}, author = {Peter Sunehag}, booktitle = {Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics}, pages = {540--547}, year = {2007}, editor = {Marina Meila and Xiaotong Shen}, volume = {2}, series = {Proceedings of Machine Learning Research}, address = {San Juan, Puerto Rico}, month = {21--24 Mar}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v2/sunehag07a/sunehag07a.pdf}, url = {http://proceedings.mlr.press/v2/sunehag07a.html}, abstract = {Several authors have recently studied the problem of creating exchangeable models for natural languages that exhibit word burstiness. Word burstiness means that a word that has appeared once in a text should be more likely to appear again than it was to appear in the first place. In this article the different existing methods are compared theoretically through a unifying framework. New models that do not satisfy the exchangeability assumption but whose probability revisions only depend on the word counts of what has previously appeared, are introduced within this framework. We will refer to these models as two-stage conditional presence/abundance models since they, just like some recently introduced models for the abundance of rare species in ecology, seperate the issue of presence from the issue of abundance when present. We will see that the widely used TF-IDF heuristic for information retrieval follows naturally from these models by calculating a crossentropy. We will also discuss a connection between TF-IDF and file formats that seperate presence from abundance given presence.} }
Endnote
%0 Conference Paper %T Emerge and spread models and word burstiness %A Peter Sunehag %B Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2007 %E Marina Meila %E Xiaotong Shen %F pmlr-v2-sunehag07a %I PMLR %J Proceedings of Machine Learning Research %P 540--547 %U http://proceedings.mlr.press %V 2 %W PMLR %X Several authors have recently studied the problem of creating exchangeable models for natural languages that exhibit word burstiness. Word burstiness means that a word that has appeared once in a text should be more likely to appear again than it was to appear in the first place. In this article the different existing methods are compared theoretically through a unifying framework. New models that do not satisfy the exchangeability assumption but whose probability revisions only depend on the word counts of what has previously appeared, are introduced within this framework. We will refer to these models as two-stage conditional presence/abundance models since they, just like some recently introduced models for the abundance of rare species in ecology, seperate the issue of presence from the issue of abundance when present. We will see that the widely used TF-IDF heuristic for information retrieval follows naturally from these models by calculating a crossentropy. We will also discuss a connection between TF-IDF and file formats that seperate presence from abundance given presence.
RIS
TY - CPAPER TI - Emerge and spread models and word burstiness AU - Peter Sunehag BT - Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics PY - 2007/03/11 DA - 2007/03/11 ED - Marina Meila ED - Xiaotong Shen ID - pmlr-v2-sunehag07a PB - PMLR SP - 540 DP - PMLR EP - 547 L1 - http://proceedings.mlr.press/v2/sunehag07a/sunehag07a.pdf UR - http://proceedings.mlr.press/v2/sunehag07a.html AB - Several authors have recently studied the problem of creating exchangeable models for natural languages that exhibit word burstiness. Word burstiness means that a word that has appeared once in a text should be more likely to appear again than it was to appear in the first place. In this article the different existing methods are compared theoretically through a unifying framework. New models that do not satisfy the exchangeability assumption but whose probability revisions only depend on the word counts of what has previously appeared, are introduced within this framework. We will refer to these models as two-stage conditional presence/abundance models since they, just like some recently introduced models for the abundance of rare species in ecology, seperate the issue of presence from the issue of abundance when present. We will see that the widely used TF-IDF heuristic for information retrieval follows naturally from these models by calculating a crossentropy. We will also discuss a connection between TF-IDF and file formats that seperate presence from abundance given presence. ER -
APA
Sunehag, P.. (2007). Emerge and spread models and word burstiness. Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, in PMLR 2:540-547

Related Material