From Word Embeddings To Document Distances

Matt Kusner, Yu Sun, Nicholas Kolkin, Kilian Weinberger
Proceedings of the 32nd International Conference on Machine Learning, PMLR 37:957-966, 2015.

Abstract

We present the Word Mover’s Distance (WMD), a novel distance function between text documents. Our work is based on recent results in word embeddings that learn semantically meaningful representations for words from local co-occurrences in sentences. The WMD distance measures the dissimilarity between two text documents as the minimum amount of distance that the embedded words of one document need to "travel" to reach the embedded words of another document. We show that this distance metric can be cast as an instance of the Earth Mover’s Distance, a well studied transportation problem for which several highly efficient solvers have been developed. Our metric has no hyperparameters and is straight-forward to implement. Further, we demonstrate on eight real world document classification data sets, in comparison with seven state-of-the-art baselines, that the WMD metric leads to unprecedented low k-nearest neighbor document classification error rates.

Cite this Paper


BibTeX
@InProceedings{pmlr-v37-kusnerb15, title = {From Word Embeddings To Document Distances}, author = {Kusner, Matt and Sun, Yu and Kolkin, Nicholas and Weinberger, Kilian}, booktitle = {Proceedings of the 32nd International Conference on Machine Learning}, pages = {957--966}, year = {2015}, editor = {Bach, Francis and Blei, David}, volume = {37}, series = {Proceedings of Machine Learning Research}, address = {Lille, France}, month = {07--09 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v37/kusnerb15.pdf}, url = { http://proceedings.mlr.press/v37/kusnerb15.html }, abstract = {We present the Word Mover’s Distance (WMD), a novel distance function between text documents. Our work is based on recent results in word embeddings that learn semantically meaningful representations for words from local co-occurrences in sentences. The WMD distance measures the dissimilarity between two text documents as the minimum amount of distance that the embedded words of one document need to "travel" to reach the embedded words of another document. We show that this distance metric can be cast as an instance of the Earth Mover’s Distance, a well studied transportation problem for which several highly efficient solvers have been developed. Our metric has no hyperparameters and is straight-forward to implement. Further, we demonstrate on eight real world document classification data sets, in comparison with seven state-of-the-art baselines, that the WMD metric leads to unprecedented low k-nearest neighbor document classification error rates.} }
Endnote
%0 Conference Paper %T From Word Embeddings To Document Distances %A Matt Kusner %A Yu Sun %A Nicholas Kolkin %A Kilian Weinberger %B Proceedings of the 32nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2015 %E Francis Bach %E David Blei %F pmlr-v37-kusnerb15 %I PMLR %P 957--966 %U http://proceedings.mlr.press/v37/kusnerb15.html %V 37 %X We present the Word Mover’s Distance (WMD), a novel distance function between text documents. Our work is based on recent results in word embeddings that learn semantically meaningful representations for words from local co-occurrences in sentences. The WMD distance measures the dissimilarity between two text documents as the minimum amount of distance that the embedded words of one document need to "travel" to reach the embedded words of another document. We show that this distance metric can be cast as an instance of the Earth Mover’s Distance, a well studied transportation problem for which several highly efficient solvers have been developed. Our metric has no hyperparameters and is straight-forward to implement. Further, we demonstrate on eight real world document classification data sets, in comparison with seven state-of-the-art baselines, that the WMD metric leads to unprecedented low k-nearest neighbor document classification error rates.
RIS
TY - CPAPER TI - From Word Embeddings To Document Distances AU - Matt Kusner AU - Yu Sun AU - Nicholas Kolkin AU - Kilian Weinberger BT - Proceedings of the 32nd International Conference on Machine Learning DA - 2015/06/01 ED - Francis Bach ED - David Blei ID - pmlr-v37-kusnerb15 PB - PMLR DP - Proceedings of Machine Learning Research VL - 37 SP - 957 EP - 966 L1 - http://proceedings.mlr.press/v37/kusnerb15.pdf UR - http://proceedings.mlr.press/v37/kusnerb15.html AB - We present the Word Mover’s Distance (WMD), a novel distance function between text documents. Our work is based on recent results in word embeddings that learn semantically meaningful representations for words from local co-occurrences in sentences. The WMD distance measures the dissimilarity between two text documents as the minimum amount of distance that the embedded words of one document need to "travel" to reach the embedded words of another document. We show that this distance metric can be cast as an instance of the Earth Mover’s Distance, a well studied transportation problem for which several highly efficient solvers have been developed. Our metric has no hyperparameters and is straight-forward to implement. Further, we demonstrate on eight real world document classification data sets, in comparison with seven state-of-the-art baselines, that the WMD metric leads to unprecedented low k-nearest neighbor document classification error rates. ER -
APA
Kusner, M., Sun, Y., Kolkin, N. & Weinberger, K.. (2015). From Word Embeddings To Document Distances. Proceedings of the 32nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 37:957-966 Available from http://proceedings.mlr.press/v37/kusnerb15.html .

Related Material