A Variational Approximation for Topic Modeling of Hierarchical Corpora

Do-kyum Kim; Geoffrey Voelker; Lawrence Saul

A Variational Approximation for Topic Modeling of Hierarchical Corpora

Do-kyum Kim, Geoffrey Voelker, Lawrence Saul

Proceedings of the 30th International Conference on Machine Learning, PMLR 28(2):55-63, 2013.

Abstract

We study the problem of topic modeling in corpora whose documents are organized in a multi-level hierarchy. We explore a parametric approach to this problem, assuming that the number of topics is known or can be estimated by cross-validation. The models we consider can be viewed as special (finite-dimensional) instances of hierarchical Dirichlet processes (HDPs). For these models we show that there exists a simple variational approximation for probabilistic inference. The approximation relies on a previously unexploited inequality that handles the conditional dependence between Dirichlet latent variables in adjacent levels of the model’s hierarchy. We compare our approach to existing implementations of nonparametric HDPs. On several benchmarks we find that our approach is faster than Gibbs sampling and able to learn more predictive models than existing variational methods. Finally, we demonstrate the large-scale viability of our approach on two newly available corpora from researchers in computer security–one with 350,000 documents and over 6,000 internal subcategories, the other with a five-level deep hierarchy.

Cite this Paper

BibTeX


@InProceedings{pmlr-v28-kim13,
  title = 	 {A Variational Approximation for Topic Modeling of Hierarchical Corpora},
  author = 	 {Kim, Do-kyum and Voelker, Geoffrey and Saul, Lawrence},
  booktitle = 	 {Proceedings of the 30th International Conference on Machine Learning},
  pages = 	 {55--63},
  year = 	 {2013},
  editor = 	 {Dasgupta, Sanjoy and McAllester, David},
  volume = 	 {28},
  number =       {2},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Atlanta, Georgia, USA},
  month = 	 {17--19 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v28/kim13.pdf},
  url = 	 {https://proceedings.mlr.press/v28/kim13.html},
  abstract = 	 {We study the problem of topic modeling in corpora whose documents are organized in a multi-level hierarchy.  We explore a parametric approach to this problem, assuming that the number of topics is known or can be estimated by cross-validation.  The models we consider can be viewed as special (finite-dimensional) instances of hierarchical Dirichlet processes (HDPs).  For these models we show that there exists a simple variational approximation for probabilistic inference.  The approximation relies on a previously unexploited inequality that handles the conditional dependence between Dirichlet latent variables in adjacent levels of the model’s hierarchy.  We compare our approach to existing implementations of nonparametric HDPs.  On several benchmarks we find that our approach is faster than Gibbs sampling and able to learn more predictive models than existing variational methods.  Finally, we demonstrate the large-scale viability of our approach on two newly available corpora from researchers in computer security–one with 350,000 documents and over 6,000 internal subcategories, the other with a five-level deep hierarchy.}
}

Endnote

%0 Conference Paper
%T A Variational Approximation for Topic Modeling of Hierarchical Corpora
%A Do-kyum Kim
%A Geoffrey Voelker
%A Lawrence Saul
%B Proceedings of the 30th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2013
%E Sanjoy Dasgupta
%E David McAllester	
%F pmlr-v28-kim13
%I PMLR
%P 55--63
%U https://proceedings.mlr.press/v28/kim13.html
%V 28
%N 2
%X We study the problem of topic modeling in corpora whose documents are organized in a multi-level hierarchy.  We explore a parametric approach to this problem, assuming that the number of topics is known or can be estimated by cross-validation.  The models we consider can be viewed as special (finite-dimensional) instances of hierarchical Dirichlet processes (HDPs).  For these models we show that there exists a simple variational approximation for probabilistic inference.  The approximation relies on a previously unexploited inequality that handles the conditional dependence between Dirichlet latent variables in adjacent levels of the model’s hierarchy.  We compare our approach to existing implementations of nonparametric HDPs.  On several benchmarks we find that our approach is faster than Gibbs sampling and able to learn more predictive models than existing variational methods.  Finally, we demonstrate the large-scale viability of our approach on two newly available corpora from researchers in computer security–one with 350,000 documents and over 6,000 internal subcategories, the other with a five-level deep hierarchy.

RIS


TY  - CPAPER
TI  - A Variational Approximation for Topic Modeling of Hierarchical Corpora
AU  - Do-kyum Kim
AU  - Geoffrey Voelker
AU  - Lawrence Saul
BT  - Proceedings of the 30th International Conference on Machine Learning
DA  - 2013/05/13
ED  - Sanjoy Dasgupta
ED  - David McAllester	
ID  - pmlr-v28-kim13
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 28
IS  - 2
SP  - 55
EP  - 63
L1  - http://proceedings.mlr.press/v28/kim13.pdf
UR  - https://proceedings.mlr.press/v28/kim13.html
AB  - We study the problem of topic modeling in corpora whose documents are organized in a multi-level hierarchy.  We explore a parametric approach to this problem, assuming that the number of topics is known or can be estimated by cross-validation.  The models we consider can be viewed as special (finite-dimensional) instances of hierarchical Dirichlet processes (HDPs).  For these models we show that there exists a simple variational approximation for probabilistic inference.  The approximation relies on a previously unexploited inequality that handles the conditional dependence between Dirichlet latent variables in adjacent levels of the model’s hierarchy.  We compare our approach to existing implementations of nonparametric HDPs.  On several benchmarks we find that our approach is faster than Gibbs sampling and able to learn more predictive models than existing variational methods.  Finally, we demonstrate the large-scale viability of our approach on two newly available corpora from researchers in computer security–one with 350,000 documents and over 6,000 internal subcategories, the other with a five-level deep hierarchy.
ER  -

APA


Kim, D., Voelker, G. & Saul, L.. (2013). A Variational Approximation for Topic Modeling of Hierarchical Corpora. Proceedings of the 30th International Conference on Machine Learning, in Proceedings of Machine Learning Research 28(2):55-63 Available from https://proceedings.mlr.press/v28/kim13.html.

A Variational Approximation for Topic Modeling of Hierarchical Corpora

Abstract

Cite this Paper

Related Material