Enhancing Topic Modeling on Short Texts with Crowdsourcing

Xiaoyan Yang; Shanshan Ying; Wenzhe Yu; Rong Zhang; Zhenjie Zhang

Enhancing Topic Modeling on Short Texts with Crowdsourcing

Xiaoyan Yang, Shanshan Ying, Wenzhe Yu, Rong Zhang, Zhenjie Zhang

Proceedings of The 8th Asian Conference on Machine Learning, PMLR 63:33-48, 2016.

Abstract

Topic modeling is nowadays widely used in text archive analytics, to find significant topics in news articles and important aspects of product comments available on the Internet. While statistical approaches, e.g. Latent Dirichlet Allocation (LDA) and its variants, are effective on building topic models on long texts, it remains difficult to identify meaningful topics over short texts, e.g. news titles and social-media messages. With the emergence and prosperity of crowdsourcing platforms, it becomes possible and easier for analytical systems to incorporate human intelligence into text analytics. Different from traditional active learning techniques, the combination of crowdsourcing and machine learning poses new challenges on the design of simple tasks for non-experts to finish in seconds. In this paper, we design a new topic modeling technique, fully exploiting the basic intuitions of humans on short text reading. By requesting human labors to subjectively measure the similarity between short text pairs, the accuracy of the topic modeling algorithms could be greatly enhanced, regardless of the prior used in the graphical model. We present well-designed short text pair selection strategies for crowdsourcing and provide analysis on the convergence property of the inference algorithm. Empirical studies show that our proposed approaches improve the result topics on English tweets and Chinese microblogs, by requesting only a small number of labels from crowd.

Cite this Paper

BibTeX

@InProceedings{pmlr-v63-yang54,
  title = 	 {Enhancing Topic Modeling on Short Texts with Crowdsourcing},
  author = 	 {Yang, Xiaoyan and Ying, Shanshan and Yu, Wenzhe and Zhang, Rong and Zhang, Zhenjie},
  booktitle = 	 {Proceedings of The 8th Asian Conference on Machine Learning},
  pages = 	 {33--48},
  year = 	 {2016},
  editor = 	 {Durrant, Robert J. and Kim, Kee-Eung},
  volume = 	 {63},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {The University of Waikato, Hamilton, New Zealand},
  month = 	 {16--18 Nov},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v63/yang54.pdf},
  url = 	 {https://proceedings.mlr.press/v63/yang54.html},
  abstract = 	 {Topic modeling is nowadays widely used in text archive analytics, to find significant topics in news articles and important aspects of product comments available on the Internet. While statistical approaches, e.g. Latent Dirichlet Allocation (LDA) and its variants, are effective on building topic models on long texts, it remains difficult to identify meaningful topics over short texts, e.g. news titles and social-media messages. With the emergence and prosperity of crowdsourcing platforms, it becomes possible and easier for analytical systems to incorporate human intelligence into text analytics.    Different from traditional active learning techniques, the combination of crowdsourcing and machine learning poses new challenges on the design of simple tasks for non-experts to finish in seconds. In this paper, we design a new topic modeling technique, fully exploiting the basic intuitions of humans on short text reading. By requesting human labors to subjectively measure the similarity between short text pairs, the accuracy of the topic modeling algorithms could be greatly enhanced, regardless of the prior used in the graphical model. We present well-designed short text pair selection strategies for crowdsourcing and provide analysis on the convergence property of the inference algorithm. Empirical studies show that our proposed approaches improve the result topics on English tweets and Chinese microblogs, by requesting only a small number of labels from crowd.}
}

Endnote

%0 Conference Paper
%T Enhancing Topic Modeling on Short Texts with Crowdsourcing
%A Xiaoyan Yang
%A Shanshan Ying
%A Wenzhe Yu
%A Rong Zhang
%A Zhenjie Zhang
%B Proceedings of The 8th Asian Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2016
%E Robert J. Durrant
%E Kee-Eung Kim	
%F pmlr-v63-yang54
%I PMLR
%P 33--48
%U https://proceedings.mlr.press/v63/yang54.html
%V 63
%X Topic modeling is nowadays widely used in text archive analytics, to find significant topics in news articles and important aspects of product comments available on the Internet. While statistical approaches, e.g. Latent Dirichlet Allocation (LDA) and its variants, are effective on building topic models on long texts, it remains difficult to identify meaningful topics over short texts, e.g. news titles and social-media messages. With the emergence and prosperity of crowdsourcing platforms, it becomes possible and easier for analytical systems to incorporate human intelligence into text analytics.    Different from traditional active learning techniques, the combination of crowdsourcing and machine learning poses new challenges on the design of simple tasks for non-experts to finish in seconds. In this paper, we design a new topic modeling technique, fully exploiting the basic intuitions of humans on short text reading. By requesting human labors to subjectively measure the similarity between short text pairs, the accuracy of the topic modeling algorithms could be greatly enhanced, regardless of the prior used in the graphical model. We present well-designed short text pair selection strategies for crowdsourcing and provide analysis on the convergence property of the inference algorithm. Empirical studies show that our proposed approaches improve the result topics on English tweets and Chinese microblogs, by requesting only a small number of labels from crowd.

RIS

TY  - CPAPER
TI  - Enhancing Topic Modeling on Short Texts with Crowdsourcing
AU  - Xiaoyan Yang
AU  - Shanshan Ying
AU  - Wenzhe Yu
AU  - Rong Zhang
AU  - Zhenjie Zhang
BT  - Proceedings of The 8th Asian Conference on Machine Learning
DA  - 2016/11/20
ED  - Robert J. Durrant
ED  - Kee-Eung Kim	
ID  - pmlr-v63-yang54
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 63
SP  - 33
EP  - 48
L1  - http://proceedings.mlr.press/v63/yang54.pdf
UR  - https://proceedings.mlr.press/v63/yang54.html
AB  - Topic modeling is nowadays widely used in text archive analytics, to find significant topics in news articles and important aspects of product comments available on the Internet. While statistical approaches, e.g. Latent Dirichlet Allocation (LDA) and its variants, are effective on building topic models on long texts, it remains difficult to identify meaningful topics over short texts, e.g. news titles and social-media messages. With the emergence and prosperity of crowdsourcing platforms, it becomes possible and easier for analytical systems to incorporate human intelligence into text analytics.    Different from traditional active learning techniques, the combination of crowdsourcing and machine learning poses new challenges on the design of simple tasks for non-experts to finish in seconds. In this paper, we design a new topic modeling technique, fully exploiting the basic intuitions of humans on short text reading. By requesting human labors to subjectively measure the similarity between short text pairs, the accuracy of the topic modeling algorithms could be greatly enhanced, regardless of the prior used in the graphical model. We present well-designed short text pair selection strategies for crowdsourcing and provide analysis on the convergence property of the inference algorithm. Empirical studies show that our proposed approaches improve the result topics on English tweets and Chinese microblogs, by requesting only a small number of labels from crowd.
ER  -

APA

Yang, X., Ying, S., Yu, W., Zhang, R. & Zhang, Z.. (2016). Enhancing Topic Modeling on Short Texts with Crowdsourcing. Proceedings of The 8th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 63:33-48 Available from https://proceedings.mlr.press/v63/yang54.html.

Related Material

Download PDF