On-the-fly Rectification for Robust Large-Vocabulary Topic Inference

Moontae Lee, Sungjun Cho, Kun Dong, David Mimno, David Bindel
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:6087-6097, 2021.

Abstract

Across many data domains, co-occurrence statistics about the joint appearance of objects are powerfully informative. By transforming unsupervised learning problems into decompositions of co-occurrence statistics, spectral algorithms provide transparent and efficient algorithms for posterior inference such as latent topic analysis and community detection. As object vocabularies grow, however, it becomes rapidly more expensive to store and run inference algorithms on co-occurrence statistics. Rectifying co-occurrence, the key process to uphold model assumptions, becomes increasingly more vital in the presence of rare terms, but current techniques cannot scale to large vocabularies. We propose novel methods that simultaneously compress and rectify co-occurrence statistics, scaling gracefully with the size of vocabulary and the dimension of latent space. We also present new algorithms learning latent variables from the compressed statistics, and verify that our methods perform comparably to previous approaches on both textual and non-textual data.

Cite this Paper


BibTeX
@InProceedings{pmlr-v139-lee21c, title = {On-the-fly Rectification for Robust Large-Vocabulary Topic Inference}, author = {Lee, Moontae and Cho, Sungjun and Dong, Kun and Mimno, David and Bindel, David}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {6087--6097}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v139/lee21c/lee21c.pdf}, url = {https://proceedings.mlr.press/v139/lee21c.html}, abstract = {Across many data domains, co-occurrence statistics about the joint appearance of objects are powerfully informative. By transforming unsupervised learning problems into decompositions of co-occurrence statistics, spectral algorithms provide transparent and efficient algorithms for posterior inference such as latent topic analysis and community detection. As object vocabularies grow, however, it becomes rapidly more expensive to store and run inference algorithms on co-occurrence statistics. Rectifying co-occurrence, the key process to uphold model assumptions, becomes increasingly more vital in the presence of rare terms, but current techniques cannot scale to large vocabularies. We propose novel methods that simultaneously compress and rectify co-occurrence statistics, scaling gracefully with the size of vocabulary and the dimension of latent space. We also present new algorithms learning latent variables from the compressed statistics, and verify that our methods perform comparably to previous approaches on both textual and non-textual data.} }
Endnote
%0 Conference Paper %T On-the-fly Rectification for Robust Large-Vocabulary Topic Inference %A Moontae Lee %A Sungjun Cho %A Kun Dong %A David Mimno %A David Bindel %B Proceedings of the 38th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Marina Meila %E Tong Zhang %F pmlr-v139-lee21c %I PMLR %P 6087--6097 %U https://proceedings.mlr.press/v139/lee21c.html %V 139 %X Across many data domains, co-occurrence statistics about the joint appearance of objects are powerfully informative. By transforming unsupervised learning problems into decompositions of co-occurrence statistics, spectral algorithms provide transparent and efficient algorithms for posterior inference such as latent topic analysis and community detection. As object vocabularies grow, however, it becomes rapidly more expensive to store and run inference algorithms on co-occurrence statistics. Rectifying co-occurrence, the key process to uphold model assumptions, becomes increasingly more vital in the presence of rare terms, but current techniques cannot scale to large vocabularies. We propose novel methods that simultaneously compress and rectify co-occurrence statistics, scaling gracefully with the size of vocabulary and the dimension of latent space. We also present new algorithms learning latent variables from the compressed statistics, and verify that our methods perform comparably to previous approaches on both textual and non-textual data.
APA
Lee, M., Cho, S., Dong, K., Mimno, D. & Bindel, D.. (2021). On-the-fly Rectification for Robust Large-Vocabulary Topic Inference. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:6087-6097 Available from https://proceedings.mlr.press/v139/lee21c.html.

Related Material