The Inverse Regression Topic Model

Maxim Rabinovich; David Blei

The Inverse Regression Topic Model

Maxim Rabinovich, David Blei

Proceedings of the 31st International Conference on Machine Learning, PMLR 32(1):199-207, 2014.

Abstract

\citettaddy13mnir proposed multinomial inverse regression (MNIR) as a new model of annotated text based on the influence of metadata and response variables on the distribution of words in a document. While effective, MNIR has no way to exploit structure in the corpus to improve its predictions or facilitate exploratory data analysis. On the other hand, traditional probabilistic topic models (like latent Dirichlet allocation) capture natural heterogeneity in a collection but do not account for external variables. In this paper, we introduce the inverse regression topic model (IRTM), a mixed-membership extension of MNIR that combines the strengths of both methodologies. We present two inference algorithms for the IRTM: an efficient batch estimation algorithm and an online variant, which is suitable for large corpora. We apply these methods to a corpus of 73K Congressional press releases and another of 150K Yelp reviews, demonstrating that the IRTM outperforms both MNIR and supervised topic models on the prediction task. Further, we give examples showing that the IRTM enables systematic discovery of in-topic lexical variation, which is not possible with previous supervised topic models.

Cite this Paper

BibTeX


@InProceedings{pmlr-v32-rabinovich14,
  title = 	 {The Inverse Regression Topic Model},
  author = 	 {Rabinovich, Maxim and Blei, David},
  booktitle = 	 {Proceedings of the 31st International Conference on Machine Learning},
  pages = 	 {199--207},
  year = 	 {2014},
  editor = 	 {Xing, Eric P. and Jebara, Tony},
  volume = 	 {32},
  number =       {1},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Bejing, China},
  month = 	 {22--24 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v32/rabinovich14.pdf},
  url = 	 {https://proceedings.mlr.press/v32/rabinovich14.html},
  abstract = 	 {\citettaddy13mnir proposed multinomial inverse regression (MNIR) as a new model of annotated text based on the influence of metadata and response variables on the distribution of words in a document. While effective, MNIR has no way to exploit structure in the corpus to improve its predictions or facilitate exploratory data analysis. On the other hand, traditional probabilistic topic models (like latent Dirichlet allocation) capture natural heterogeneity in a collection but do not account for external variables. In this paper, we introduce the inverse regression topic model (IRTM), a mixed-membership extension of MNIR that combines the strengths of both methodologies. We present two inference algorithms for the IRTM: an efficient batch estimation  algorithm and an online variant, which is suitable for large corpora.  We apply these methods to a corpus of 73K Congressional press releases  and another of 150K Yelp reviews, demonstrating that the IRTM  outperforms both MNIR and supervised topic models on the prediction task.  Further, we give examples showing that the IRTM enables systematic  discovery of in-topic lexical variation, which is not possible with previous supervised topic models.}
}

Endnote

%0 Conference Paper
%T The Inverse Regression Topic Model
%A Maxim Rabinovich
%A David Blei
%B Proceedings of the 31st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2014
%E Eric P. Xing
%E Tony Jebara	
%F pmlr-v32-rabinovich14
%I PMLR
%P 199--207
%U https://proceedings.mlr.press/v32/rabinovich14.html
%V 32
%N 1
%X \citettaddy13mnir proposed multinomial inverse regression (MNIR) as a new model of annotated text based on the influence of metadata and response variables on the distribution of words in a document. While effective, MNIR has no way to exploit structure in the corpus to improve its predictions or facilitate exploratory data analysis. On the other hand, traditional probabilistic topic models (like latent Dirichlet allocation) capture natural heterogeneity in a collection but do not account for external variables. In this paper, we introduce the inverse regression topic model (IRTM), a mixed-membership extension of MNIR that combines the strengths of both methodologies. We present two inference algorithms for the IRTM: an efficient batch estimation  algorithm and an online variant, which is suitable for large corpora.  We apply these methods to a corpus of 73K Congressional press releases  and another of 150K Yelp reviews, demonstrating that the IRTM  outperforms both MNIR and supervised topic models on the prediction task.  Further, we give examples showing that the IRTM enables systematic  discovery of in-topic lexical variation, which is not possible with previous supervised topic models.

RIS


TY  - CPAPER
TI  - The Inverse Regression Topic Model
AU  - Maxim Rabinovich
AU  - David Blei
BT  - Proceedings of the 31st International Conference on Machine Learning
DA  - 2014/01/27
ED  - Eric P. Xing
ED  - Tony Jebara	
ID  - pmlr-v32-rabinovich14
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 32
IS  - 1
SP  - 199
EP  - 207
L1  - http://proceedings.mlr.press/v32/rabinovich14.pdf
UR  - https://proceedings.mlr.press/v32/rabinovich14.html
AB  - \citettaddy13mnir proposed multinomial inverse regression (MNIR) as a new model of annotated text based on the influence of metadata and response variables on the distribution of words in a document. While effective, MNIR has no way to exploit structure in the corpus to improve its predictions or facilitate exploratory data analysis. On the other hand, traditional probabilistic topic models (like latent Dirichlet allocation) capture natural heterogeneity in a collection but do not account for external variables. In this paper, we introduce the inverse regression topic model (IRTM), a mixed-membership extension of MNIR that combines the strengths of both methodologies. We present two inference algorithms for the IRTM: an efficient batch estimation  algorithm and an online variant, which is suitable for large corpora.  We apply these methods to a corpus of 73K Congressional press releases  and another of 150K Yelp reviews, demonstrating that the IRTM  outperforms both MNIR and supervised topic models on the prediction task.  Further, we give examples showing that the IRTM enables systematic  discovery of in-topic lexical variation, which is not possible with previous supervised topic models.
ER  -

APA


Rabinovich, M. & Blei, D.. (2014). The Inverse Regression Topic Model. Proceedings of the 31st International Conference on Machine Learning, in Proceedings of Machine Learning Research 32(1):199-207 Available from https://proceedings.mlr.press/v32/rabinovich14.html.

The Inverse Regression Topic Model

Abstract

Cite this Paper

Related Material