Enhancing Topic Modeling on Short Texts with Crowdsourcing

Xiaoyan Yang, Shanshan Ying, Wenzhe Yu, Rong Zhang, Zhenjie Zhang
Proceedings of The 8th Asian Conference on Machine Learning, PMLR 63:33-48, 2016.

Abstract

Topic modeling is nowadays widely used in text archive analytics, to find significant topics in news articles and important aspects of product comments available on the Internet. While statistical approaches, e.g. Latent Dirichlet Allocation (LDA) and its variants, are effective on building topic models on long texts, it remains difficult to identify meaningful topics over short texts, e.g. news titles and social-media messages. With the emergence and prosperity of crowdsourcing platforms, it becomes possible and easier for analytical systems to incorporate human intelligence into text analytics. Different from traditional active learning techniques, the combination of crowdsourcing and machine learning poses new challenges on the design of simple tasks for non-experts to finish in seconds. In this paper, we design a new topic modeling technique, fully exploiting the basic intuitions of humans on short text reading. By requesting human labors to subjectively measure the similarity between short text pairs, the accuracy of the topic modeling algorithms could be greatly enhanced, regardless of the prior used in the graphical model. We present well-designed short text pair selection strategies for crowdsourcing and provide analysis on the convergence property of the inference algorithm. Empirical studies show that our proposed approaches improve the result topics on English tweets and Chinese microblogs, by requesting only a small number of labels from crowd.

Cite this Paper


BibTeX
@InProceedings{pmlr-v63-yang54, title = {Enhancing Topic Modeling on Short Texts with Crowdsourcing}, author = {Yang, Xiaoyan and Ying, Shanshan and Yu, Wenzhe and Zhang, Rong and Zhang, Zhenjie}, booktitle = {Proceedings of The 8th Asian Conference on Machine Learning}, pages = {33--48}, year = {2016}, editor = {Durrant, Robert J. and Kim, Kee-Eung}, volume = {63}, series = {Proceedings of Machine Learning Research}, address = {The University of Waikato, Hamilton, New Zealand}, month = {16--18 Nov}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v63/yang54.pdf}, url = {https://proceedings.mlr.press/v63/yang54.html}, abstract = {Topic modeling is nowadays widely used in text archive analytics, to find significant topics in news articles and important aspects of product comments available on the Internet. While statistical approaches, e.g. Latent Dirichlet Allocation (LDA) and its variants, are effective on building topic models on long texts, it remains difficult to identify meaningful topics over short texts, e.g. news titles and social-media messages. With the emergence and prosperity of crowdsourcing platforms, it becomes possible and easier for analytical systems to incorporate human intelligence into text analytics. Different from traditional active learning techniques, the combination of crowdsourcing and machine learning poses new challenges on the design of simple tasks for non-experts to finish in seconds. In this paper, we design a new topic modeling technique, fully exploiting the basic intuitions of humans on short text reading. By requesting human labors to subjectively measure the similarity between short text pairs, the accuracy of the topic modeling algorithms could be greatly enhanced, regardless of the prior used in the graphical model. We present well-designed short text pair selection strategies for crowdsourcing and provide analysis on the convergence property of the inference algorithm. Empirical studies show that our proposed approaches improve the result topics on English tweets and Chinese microblogs, by requesting only a small number of labels from crowd.} }
Endnote
%0 Conference Paper %T Enhancing Topic Modeling on Short Texts with Crowdsourcing %A Xiaoyan Yang %A Shanshan Ying %A Wenzhe Yu %A Rong Zhang %A Zhenjie Zhang %B Proceedings of The 8th Asian Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2016 %E Robert J. Durrant %E Kee-Eung Kim %F pmlr-v63-yang54 %I PMLR %P 33--48 %U https://proceedings.mlr.press/v63/yang54.html %V 63 %X Topic modeling is nowadays widely used in text archive analytics, to find significant topics in news articles and important aspects of product comments available on the Internet. While statistical approaches, e.g. Latent Dirichlet Allocation (LDA) and its variants, are effective on building topic models on long texts, it remains difficult to identify meaningful topics over short texts, e.g. news titles and social-media messages. With the emergence and prosperity of crowdsourcing platforms, it becomes possible and easier for analytical systems to incorporate human intelligence into text analytics. Different from traditional active learning techniques, the combination of crowdsourcing and machine learning poses new challenges on the design of simple tasks for non-experts to finish in seconds. In this paper, we design a new topic modeling technique, fully exploiting the basic intuitions of humans on short text reading. By requesting human labors to subjectively measure the similarity between short text pairs, the accuracy of the topic modeling algorithms could be greatly enhanced, regardless of the prior used in the graphical model. We present well-designed short text pair selection strategies for crowdsourcing and provide analysis on the convergence property of the inference algorithm. Empirical studies show that our proposed approaches improve the result topics on English tweets and Chinese microblogs, by requesting only a small number of labels from crowd.
RIS
TY - CPAPER TI - Enhancing Topic Modeling on Short Texts with Crowdsourcing AU - Xiaoyan Yang AU - Shanshan Ying AU - Wenzhe Yu AU - Rong Zhang AU - Zhenjie Zhang BT - Proceedings of The 8th Asian Conference on Machine Learning DA - 2016/11/20 ED - Robert J. Durrant ED - Kee-Eung Kim ID - pmlr-v63-yang54 PB - PMLR DP - Proceedings of Machine Learning Research VL - 63 SP - 33 EP - 48 L1 - http://proceedings.mlr.press/v63/yang54.pdf UR - https://proceedings.mlr.press/v63/yang54.html AB - Topic modeling is nowadays widely used in text archive analytics, to find significant topics in news articles and important aspects of product comments available on the Internet. While statistical approaches, e.g. Latent Dirichlet Allocation (LDA) and its variants, are effective on building topic models on long texts, it remains difficult to identify meaningful topics over short texts, e.g. news titles and social-media messages. With the emergence and prosperity of crowdsourcing platforms, it becomes possible and easier for analytical systems to incorporate human intelligence into text analytics. Different from traditional active learning techniques, the combination of crowdsourcing and machine learning poses new challenges on the design of simple tasks for non-experts to finish in seconds. In this paper, we design a new topic modeling technique, fully exploiting the basic intuitions of humans on short text reading. By requesting human labors to subjectively measure the similarity between short text pairs, the accuracy of the topic modeling algorithms could be greatly enhanced, regardless of the prior used in the graphical model. We present well-designed short text pair selection strategies for crowdsourcing and provide analysis on the convergence property of the inference algorithm. Empirical studies show that our proposed approaches improve the result topics on English tweets and Chinese microblogs, by requesting only a small number of labels from crowd. ER -
APA
Yang, X., Ying, S., Yu, W., Zhang, R. & Zhang, Z.. (2016). Enhancing Topic Modeling on Short Texts with Crowdsourcing. Proceedings of The 8th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 63:33-48 Available from https://proceedings.mlr.press/v63/yang54.html.

Related Material