[edit]
Two-dimensional visualization of large document libraries using t-SNE
Proceedings of Topological, Algebraic, and Geometric Learning Workshops 2022, PMLR 196:133-141, 2022.
Abstract
We benchmarked different approaches for creating 2D visualizations of large document libraries, using the {MEDLINE} ({PubMed}) database of the entire biomedical literature as a use case (19 million scientific papers). Our optimal pipeline is based on log-scaled {TF-IDF} representation of the abstract text, {SVD} preprocessing, and {t-SNE} with uniform affinities, early exaggeration annealing, and extended optimization. The resulting embedding distorts local neighborhoods but shows meaningful organization and rich structure on the level of narrow academic fields.