The Shape of Words - topological structure in natural language data

Stephen Fitz
Proceedings of Topological, Algebraic, and Geometric Learning Workshops 2022, PMLR 196:116-123, 2022.

Abstract

This paper presents a novel method, based on the ideas from algebraic topology, for the analysis of raw natural language text. The paper introduces the notion of a word manifold - a simplicial complex, whose topology encodes grammatical structure expressed by the corpus. Results of experiments with a variety of natural and synthetic languages are presented, showing that the homotopy type of the word manifold is influenced by linguistic structure. The analysis includes a new approach to the Voynich Manuscript - an unsolved puzzle in corpus linguistics. In contrast to existing topological data analysis approaches, we do not rely on the apparatus of persistent homology. Instead, we develop a method of generating topological structure directly from strings of words.

Cite this Paper


BibTeX
@InProceedings{pmlr-v196-fitz22a, title = {The Shape of Words - topological structure in natural language data}, author = {Fitz, Stephen}, booktitle = {Proceedings of Topological, Algebraic, and Geometric Learning Workshops 2022}, pages = {116--123}, year = {2022}, editor = {Cloninger, Alexander and Doster, Timothy and Emerson, Tegan and Kaul, Manohar and Ktena, Ira and Kvinge, Henry and Miolane, Nina and Rieck, Bastian and Tymochko, Sarah and Wolf, Guy}, volume = {196}, series = {Proceedings of Machine Learning Research}, month = {25 Feb--22 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v196/fitz22a/fitz22a.pdf}, url = {https://proceedings.mlr.press/v196/fitz22a.html}, abstract = {This paper presents a novel method, based on the ideas from algebraic topology, for the analysis of raw natural language text. The paper introduces the notion of a word manifold - a simplicial complex, whose topology encodes grammatical structure expressed by the corpus. Results of experiments with a variety of natural and synthetic languages are presented, showing that the homotopy type of the word manifold is influenced by linguistic structure. The analysis includes a new approach to the Voynich Manuscript - an unsolved puzzle in corpus linguistics. In contrast to existing topological data analysis approaches, we do not rely on the apparatus of persistent homology. Instead, we develop a method of generating topological structure directly from strings of words.} }
Endnote
%0 Conference Paper %T The Shape of Words - topological structure in natural language data %A Stephen Fitz %B Proceedings of Topological, Algebraic, and Geometric Learning Workshops 2022 %C Proceedings of Machine Learning Research %D 2022 %E Alexander Cloninger %E Timothy Doster %E Tegan Emerson %E Manohar Kaul %E Ira Ktena %E Henry Kvinge %E Nina Miolane %E Bastian Rieck %E Sarah Tymochko %E Guy Wolf %F pmlr-v196-fitz22a %I PMLR %P 116--123 %U https://proceedings.mlr.press/v196/fitz22a.html %V 196 %X This paper presents a novel method, based on the ideas from algebraic topology, for the analysis of raw natural language text. The paper introduces the notion of a word manifold - a simplicial complex, whose topology encodes grammatical structure expressed by the corpus. Results of experiments with a variety of natural and synthetic languages are presented, showing that the homotopy type of the word manifold is influenced by linguistic structure. The analysis includes a new approach to the Voynich Manuscript - an unsolved puzzle in corpus linguistics. In contrast to existing topological data analysis approaches, we do not rely on the apparatus of persistent homology. Instead, we develop a method of generating topological structure directly from strings of words.
APA
Fitz, S.. (2022). The Shape of Words - topological structure in natural language data. Proceedings of Topological, Algebraic, and Geometric Learning Workshops 2022, in Proceedings of Machine Learning Research 196:116-123 Available from https://proceedings.mlr.press/v196/fitz22a.html.

Related Material