Two-dimensional visualization of large document libraries using t-SNE

Rita González-Márquez, Philipp Berens, Dmitry Kobak
Proceedings of Topological, Algebraic, and Geometric Learning Workshops 2022, PMLR 196:133-141, 2022.

Abstract

We benchmarked different approaches for creating 2D visualizations of large document libraries, using the {MEDLINE} ({PubMed}) database of the entire biomedical literature as a use case (19 million scientific papers). Our optimal pipeline is based on log-scaled {TF-IDF} representation of the abstract text, {SVD} preprocessing, and {t-SNE} with uniform affinities, early exaggeration annealing, and extended optimization. The resulting embedding distorts local neighborhoods but shows meaningful organization and rich structure on the level of narrow academic fields.

Cite this Paper


BibTeX
@InProceedings{pmlr-v196-gonzalez-marquez22a, title = {Two-dimensional Visualization of Large Document Libraries Using {t-SNE}}, author = {Gonz\'alez-M\'arquez, Rita and Berens, Philipp and Kobak, Dmitry}, booktitle = {Proceedings of Topological, Algebraic, and Geometric Learning Workshops 2022}, pages = {133--141}, year = {2022}, editor = {Cloninger, Alexander and Doster, Timothy and Emerson, Tegan and Kaul, Manohar and Ktena, Ira and Kvinge, Henry and Miolane, Nina and Rieck, Bastian and Tymochko, Sarah and Wolf, Guy}, volume = {196}, series = {Proceedings of Machine Learning Research}, month = {25 Feb--22 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v196/gonzalez-marquez22a/gonzalez-marquez22a.pdf}, url = {https://proceedings.mlr.press/v196/gonzalez-marquez22a.html}, abstract = {We benchmarked different approaches for creating 2D visualizations of large document libraries, using the {MEDLINE} ({PubMed}) database of the entire biomedical literature as a use case (19 million scientific papers). Our optimal pipeline is based on log-scaled {TF-IDF} representation of the abstract text, {SVD} preprocessing, and {t-SNE} with uniform affinities, early exaggeration annealing, and extended optimization. The resulting embedding distorts local neighborhoods but shows meaningful organization and rich structure on the level of narrow academic fields.} }
Endnote
%0 Conference Paper %T Two-dimensional visualization of large document libraries using t-SNE %A Rita González-Márquez %A Philipp Berens %A Dmitry Kobak %B Proceedings of Topological, Algebraic, and Geometric Learning Workshops 2022 %C Proceedings of Machine Learning Research %D 2022 %E Alexander Cloninger %E Timothy Doster %E Tegan Emerson %E Manohar Kaul %E Ira Ktena %E Henry Kvinge %E Nina Miolane %E Bastian Rieck %E Sarah Tymochko %E Guy Wolf %F pmlr-v196-gonzalez-marquez22a %I PMLR %P 133--141 %U https://proceedings.mlr.press/v196/gonzalez-marquez22a.html %V 196 %X We benchmarked different approaches for creating 2D visualizations of large document libraries, using the {MEDLINE} ({PubMed}) database of the entire biomedical literature as a use case (19 million scientific papers). Our optimal pipeline is based on log-scaled {TF-IDF} representation of the abstract text, {SVD} preprocessing, and {t-SNE} with uniform affinities, early exaggeration annealing, and extended optimization. The resulting embedding distorts local neighborhoods but shows meaningful organization and rich structure on the level of narrow academic fields.
APA
González-Márquez, R., Berens, P. & Kobak, D.. (2022). Two-dimensional visualization of large document libraries using t-SNE. Proceedings of Topological, Algebraic, and Geometric Learning Workshops 2022, in Proceedings of Machine Learning Research 196:133-141 Available from https://proceedings.mlr.press/v196/gonzalez-marquez22a.html.

Related Material