Enhancing Topic Models by Incorporating Explicit and Implicit External Knowledge

Yang Hong, Xinhuai Tang, Tiancheng Tang, Yunlong Hu, Jintai Tian
Proceedings of The 12th Asian Conference on Machine Learning, PMLR 129:353-368, 2020.

Abstract

Topic models are widely used for extracting latent features from documents. Conventional count-based models like LDA focus on co-occurrence of words, neglecting features like semantics and lexical relations in the corpora. To overcome this drawback, many knowledge-enhanced models are proposed, attempting to achieve better topic coherence with external knowledge. In this paper, we present novel probabilistic topic models utilizing both explicit and implicit knowledge forms. Knowledge of real-world entities in a knowledge base/graph are referred to as explicit knowledge. We incorporate this knowledge form into our models by entity linking, a technique for bridging the gap between corpora and knowledge bases. This helps solving the problem of token/phrase-level synonymy and polysemy. Apart from explicit knowledge, we utilize latent feature word representations (implicit knowledge) to further capture lexical relations in pretraining corpora. Qualitative and Quantitative evaluations are conducted on 2 datasets with 5 baselines (3 probabilistic models and 2 neural models). Our models exhibit high potential in generating coherent topics. Remarkably, when adopting both explicit and implicit knowledge, our proposed model even outperforms 2 state-of-the-art neural topic models, suggesting that knowledge-enhancement can highly improve the performance of conventional topic models.

Cite this Paper


BibTeX
@InProceedings{pmlr-v129-hong20a, title = {Enhancing Topic Models by Incorporating Explicit and Implicit External Knowledge}, author = {Hong, Yang and Tang, Xinhuai and Tang, Tiancheng and Hu, Yunlong and Tian, Jintai}, booktitle = {Proceedings of The 12th Asian Conference on Machine Learning}, pages = {353--368}, year = {2020}, editor = {Pan, Sinno Jialin and Sugiyama, Masashi}, volume = {129}, series = {Proceedings of Machine Learning Research}, month = {18--20 Nov}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v129/hong20a/hong20a.pdf}, url = {https://proceedings.mlr.press/v129/hong20a.html}, abstract = {Topic models are widely used for extracting latent features from documents. Conventional count-based models like LDA focus on co-occurrence of words, neglecting features like semantics and lexical relations in the corpora. To overcome this drawback, many knowledge-enhanced models are proposed, attempting to achieve better topic coherence with external knowledge. In this paper, we present novel probabilistic topic models utilizing both explicit and implicit knowledge forms. Knowledge of real-world entities in a knowledge base/graph are referred to as explicit knowledge. We incorporate this knowledge form into our models by entity linking, a technique for bridging the gap between corpora and knowledge bases. This helps solving the problem of token/phrase-level synonymy and polysemy. Apart from explicit knowledge, we utilize latent feature word representations (implicit knowledge) to further capture lexical relations in pretraining corpora. Qualitative and Quantitative evaluations are conducted on 2 datasets with 5 baselines (3 probabilistic models and 2 neural models). Our models exhibit high potential in generating coherent topics. Remarkably, when adopting both explicit and implicit knowledge, our proposed model even outperforms 2 state-of-the-art neural topic models, suggesting that knowledge-enhancement can highly improve the performance of conventional topic models. } }
Endnote
%0 Conference Paper %T Enhancing Topic Models by Incorporating Explicit and Implicit External Knowledge %A Yang Hong %A Xinhuai Tang %A Tiancheng Tang %A Yunlong Hu %A Jintai Tian %B Proceedings of The 12th Asian Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2020 %E Sinno Jialin Pan %E Masashi Sugiyama %F pmlr-v129-hong20a %I PMLR %P 353--368 %U https://proceedings.mlr.press/v129/hong20a.html %V 129 %X Topic models are widely used for extracting latent features from documents. Conventional count-based models like LDA focus on co-occurrence of words, neglecting features like semantics and lexical relations in the corpora. To overcome this drawback, many knowledge-enhanced models are proposed, attempting to achieve better topic coherence with external knowledge. In this paper, we present novel probabilistic topic models utilizing both explicit and implicit knowledge forms. Knowledge of real-world entities in a knowledge base/graph are referred to as explicit knowledge. We incorporate this knowledge form into our models by entity linking, a technique for bridging the gap between corpora and knowledge bases. This helps solving the problem of token/phrase-level synonymy and polysemy. Apart from explicit knowledge, we utilize latent feature word representations (implicit knowledge) to further capture lexical relations in pretraining corpora. Qualitative and Quantitative evaluations are conducted on 2 datasets with 5 baselines (3 probabilistic models and 2 neural models). Our models exhibit high potential in generating coherent topics. Remarkably, when adopting both explicit and implicit knowledge, our proposed model even outperforms 2 state-of-the-art neural topic models, suggesting that knowledge-enhancement can highly improve the performance of conventional topic models.
APA
Hong, Y., Tang, X., Tang, T., Hu, Y. & Tian, J.. (2020). Enhancing Topic Models by Incorporating Explicit and Implicit External Knowledge. Proceedings of The 12th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 129:353-368 Available from https://proceedings.mlr.press/v129/hong20a.html.

Related Material