[edit]
The Shape of Words - topological structure in natural language data
Proceedings of Topological, Algebraic, and Geometric Learning Workshops 2022, PMLR 196:116-123, 2022.
Abstract
This paper presents a novel method, based on the ideas from algebraic topology, for the analysis of raw natural language text. The paper introduces the notion of a word manifold - a simplicial complex, whose topology encodes grammatical structure expressed by the corpus. Results of experiments with a variety of natural and synthetic languages are presented, showing that the homotopy type of the word manifold is influenced by linguistic structure. The analysis includes a new approach to the Voynich Manuscript - an unsolved puzzle in corpus linguistics. In contrast to existing topological data analysis approaches, we do not rely on the apparatus of persistent homology. Instead, we develop a method of generating topological structure directly from strings of words.