Model Selection for Topic Models via Spectral Decomposition

Dehua Cheng, Xinran He, Yan Liu
Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, PMLR 38:183-191, 2015.

Abstract

Topic models have achieved significant successes in analyzing large-scale text corpus. In practical applications, we are always confronted with the challenge of model selection, i.e., how to appropriately set the number of topics. Following the recent advances in topic models via tensor decomposition, we make a first attempt to provide theoretical analysis on model selection in latent Dirichlet allocation. With mild conditions, we derive the upper bound and lower bound on the number of topics given a text collection of finite size. Experimental results demonstrate that our bounds are correct and tight. Furthermore, using Gaussian mixture model as an example, we show that our methodology can be easily generalized to model selection analysis in other latent models.

Cite this Paper


BibTeX
@InProceedings{pmlr-v38-cheng15, title = {{Model Selection for Topic Models via Spectral Decomposition}}, author = {Dehua Cheng and Xinran He and Yan Liu}, booktitle = {Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics}, pages = {183--191}, year = {2015}, editor = {Guy Lebanon and S. V. N. Vishwanathan}, volume = {38}, series = {Proceedings of Machine Learning Research}, address = {San Diego, California, USA}, month = {09--12 May}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v38/cheng15.pdf}, url = { http://proceedings.mlr.press/v38/cheng15.html }, abstract = {Topic models have achieved significant successes in analyzing large-scale text corpus. In practical applications, we are always confronted with the challenge of model selection, i.e., how to appropriately set the number of topics. Following the recent advances in topic models via tensor decomposition, we make a first attempt to provide theoretical analysis on model selection in latent Dirichlet allocation. With mild conditions, we derive the upper bound and lower bound on the number of topics given a text collection of finite size. Experimental results demonstrate that our bounds are correct and tight. Furthermore, using Gaussian mixture model as an example, we show that our methodology can be easily generalized to model selection analysis in other latent models.} }
Endnote
%0 Conference Paper %T Model Selection for Topic Models via Spectral Decomposition %A Dehua Cheng %A Xinran He %A Yan Liu %B Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2015 %E Guy Lebanon %E S. V. N. Vishwanathan %F pmlr-v38-cheng15 %I PMLR %P 183--191 %U http://proceedings.mlr.press/v38/cheng15.html %V 38 %X Topic models have achieved significant successes in analyzing large-scale text corpus. In practical applications, we are always confronted with the challenge of model selection, i.e., how to appropriately set the number of topics. Following the recent advances in topic models via tensor decomposition, we make a first attempt to provide theoretical analysis on model selection in latent Dirichlet allocation. With mild conditions, we derive the upper bound and lower bound on the number of topics given a text collection of finite size. Experimental results demonstrate that our bounds are correct and tight. Furthermore, using Gaussian mixture model as an example, we show that our methodology can be easily generalized to model selection analysis in other latent models.
RIS
TY - CPAPER TI - Model Selection for Topic Models via Spectral Decomposition AU - Dehua Cheng AU - Xinran He AU - Yan Liu BT - Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics DA - 2015/02/21 ED - Guy Lebanon ED - S. V. N. Vishwanathan ID - pmlr-v38-cheng15 PB - PMLR DP - Proceedings of Machine Learning Research VL - 38 SP - 183 EP - 191 L1 - http://proceedings.mlr.press/v38/cheng15.pdf UR - http://proceedings.mlr.press/v38/cheng15.html AB - Topic models have achieved significant successes in analyzing large-scale text corpus. In practical applications, we are always confronted with the challenge of model selection, i.e., how to appropriately set the number of topics. Following the recent advances in topic models via tensor decomposition, we make a first attempt to provide theoretical analysis on model selection in latent Dirichlet allocation. With mild conditions, we derive the upper bound and lower bound on the number of topics given a text collection of finite size. Experimental results demonstrate that our bounds are correct and tight. Furthermore, using Gaussian mixture model as an example, we show that our methodology can be easily generalized to model selection analysis in other latent models. ER -
APA
Cheng, D., He, X. & Liu, Y.. (2015). Model Selection for Topic Models via Spectral Decomposition. Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 38:183-191 Available from http://proceedings.mlr.press/v38/cheng15.html .

Related Material