Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis

Jian Tang; Zhaoshi Meng; Xuanlong Nguyen; Qiaozhu Mei; Ming Zhang

Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis

Jian Tang, Zhaoshi Meng, Xuanlong Nguyen, Qiaozhu Mei, Ming Zhang

Proceedings of the 31st International Conference on Machine Learning, PMLR 32(1):190-198, 2014.

Abstract

Topic models such as the latent Dirichlet allocation (LDA) have become a standard staple in the modeling toolbox of machine learning. They have been applied to a vast variety of data sets, contexts, and tasks to varying degrees of success. However, to date there is almost no formal theory explicating the LDA’s behavior, and despite its familiarity there is very little systematic analysis of and guidance on the properties of the data that affect the inferential performance of the model. This paper seeks to address this gap, by providing a systematic analysis of factors which characterize the LDA’s performance. We present theorems elucidating the posterior contraction rates of the topics as the amount of data increases, and a thorough supporting empirical study using synthetic and real data sets, including news and web-based articles and tweet messages. Based on these results we provide practical guidance on how to identify suitable data sets for topic models, and how to specify particular model parameters.

Cite this Paper

BibTeX


@InProceedings{pmlr-v32-tang14,
  title = 	 {Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis},
  author = 	 {Tang, Jian and Meng, Zhaoshi and Nguyen, Xuanlong and Mei, Qiaozhu and Zhang, Ming},
  booktitle = 	 {Proceedings of the 31st International Conference on Machine Learning},
  pages = 	 {190--198},
  year = 	 {2014},
  editor = 	 {Xing, Eric P. and Jebara, Tony},
  volume = 	 {32},
  number =       {1},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Bejing, China},
  month = 	 {22--24 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v32/tang14.pdf},
  url = 	 {https://proceedings.mlr.press/v32/tang14.html},
  abstract = 	 {Topic models such as the latent Dirichlet allocation (LDA) have become a standard staple in the modeling toolbox of machine learning. They have been applied to a vast variety of data sets, contexts, and tasks to varying degrees of success. However, to date there is almost no formal theory explicating the LDA’s behavior, and despite its familiarity there is very little systematic analysis of and guidance on the properties of the data that affect the inferential performance of the model. This paper seeks to address this gap, by providing a systematic analysis of factors which characterize the LDA’s performance.  We present theorems elucidating the posterior contraction rates of the topics as the amount of data increases, and a thorough supporting empirical study using synthetic and real data sets, including news and web-based articles and tweet messages. Based on these results we provide practical guidance on how to identify suitable data sets for topic models, and how to specify particular model parameters.}
}

Endnote

%0 Conference Paper
%T Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis
%A Jian Tang
%A Zhaoshi Meng
%A Xuanlong Nguyen
%A Qiaozhu Mei
%A Ming Zhang
%B Proceedings of the 31st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2014
%E Eric P. Xing
%E Tony Jebara	
%F pmlr-v32-tang14
%I PMLR
%P 190--198
%U https://proceedings.mlr.press/v32/tang14.html
%V 32
%N 1
%X Topic models such as the latent Dirichlet allocation (LDA) have become a standard staple in the modeling toolbox of machine learning. They have been applied to a vast variety of data sets, contexts, and tasks to varying degrees of success. However, to date there is almost no formal theory explicating the LDA’s behavior, and despite its familiarity there is very little systematic analysis of and guidance on the properties of the data that affect the inferential performance of the model. This paper seeks to address this gap, by providing a systematic analysis of factors which characterize the LDA’s performance.  We present theorems elucidating the posterior contraction rates of the topics as the amount of data increases, and a thorough supporting empirical study using synthetic and real data sets, including news and web-based articles and tweet messages. Based on these results we provide practical guidance on how to identify suitable data sets for topic models, and how to specify particular model parameters.

RIS


TY  - CPAPER
TI  - Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis
AU  - Jian Tang
AU  - Zhaoshi Meng
AU  - Xuanlong Nguyen
AU  - Qiaozhu Mei
AU  - Ming Zhang
BT  - Proceedings of the 31st International Conference on Machine Learning
DA  - 2014/01/27
ED  - Eric P. Xing
ED  - Tony Jebara	
ID  - pmlr-v32-tang14
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 32
IS  - 1
SP  - 190
EP  - 198
L1  - http://proceedings.mlr.press/v32/tang14.pdf
UR  - https://proceedings.mlr.press/v32/tang14.html
AB  - Topic models such as the latent Dirichlet allocation (LDA) have become a standard staple in the modeling toolbox of machine learning. They have been applied to a vast variety of data sets, contexts, and tasks to varying degrees of success. However, to date there is almost no formal theory explicating the LDA’s behavior, and despite its familiarity there is very little systematic analysis of and guidance on the properties of the data that affect the inferential performance of the model. This paper seeks to address this gap, by providing a systematic analysis of factors which characterize the LDA’s performance.  We present theorems elucidating the posterior contraction rates of the topics as the amount of data increases, and a thorough supporting empirical study using synthetic and real data sets, including news and web-based articles and tweet messages. Based on these results we provide practical guidance on how to identify suitable data sets for topic models, and how to specify particular model parameters.
ER  -

APA


Tang, J., Meng, Z., Nguyen, X., Mei, Q. & Zhang, M.. (2014). Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis. Proceedings of the 31st International Conference on Machine Learning, in Proceedings of Machine Learning Research 32(1):190-198 Available from https://proceedings.mlr.press/v32/tang14.html.

Related Material

Download PDF