Distributed Representations of Sentences and Documents

Quoc Le; Tomas Mikolov

Distributed Representations of Sentences and Documents

Quoc Le, Tomas Mikolov

Proceedings of the 31st International Conference on Machine Learning, PMLR 32(2):1188-1196, 2014.

Abstract

Many machine learning algorithms require the input to be represented as a fixed length feature vector. When it comes to texts, one of the most common representations is bag-of-words. Despite their popularity, bag-of-words models have two major weaknesses: they lose the ordering of the words and they also ignore semantics of the words. For example, "powerful," "strong" and "Paris" are equally distant. In this paper, we propose an unsupervised algorithm that learns vector representations of sentences and text documents. This algorithm represents each document by a dense vector which is trained to predict words in the document. Its construction gives our algorithm the potential to overcome the weaknesses of bag-of-words models. Empirical results show that our technique outperforms bag-of-words models as well as other techniques for text representations. Finally, we achieve new state-of-the-art results on several text classification and sentiment analysis tasks.

Cite this Paper

BibTeX


@InProceedings{pmlr-v32-le14,
  title = 	 {Distributed Representations of Sentences and Documents},
  author = 	 {Le, Quoc and Mikolov, Tomas},
  booktitle = 	 {Proceedings of the 31st International Conference on Machine Learning},
  pages = 	 {1188--1196},
  year = 	 {2014},
  editor = 	 {Xing, Eric P. and Jebara, Tony},
  volume = 	 {32},
  number =       {2},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Bejing, China},
  month = 	 {22--24 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v32/le14.pdf},
  url = 	 {https://proceedings.mlr.press/v32/le14.html},
  abstract = 	 {Many machine learning algorithms require the  input to be represented as a fixed length feature  vector. When it comes to texts, one of the most  common representations is bag-of-words. Despite their popularity, bag-of-words models have  two major weaknesses: they lose the ordering  of the words and they also ignore semantics of  the words. For example, "powerful," "strong"  and "Paris" are equally distant. In this paper,  we propose an unsupervised algorithm that learns  vector representations of sentences and text documents. This algorithm represents each document by a dense vector which is trained to predict  words in the document. Its construction gives our  algorithm the potential to overcome the weaknesses of bag-of-words models. Empirical results show that our technique outperforms bag-of-words models as well as other techniques for  text representations. Finally, we achieve new  state-of-the-art results on several text classification and sentiment analysis tasks.}
}

Endnote

%0 Conference Paper
%T Distributed Representations of Sentences and Documents
%A Quoc Le
%A Tomas Mikolov
%B Proceedings of the 31st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2014
%E Eric P. Xing
%E Tony Jebara	
%F pmlr-v32-le14
%I PMLR
%P 1188--1196
%U https://proceedings.mlr.press/v32/le14.html
%V 32
%N 2
%X Many machine learning algorithms require the  input to be represented as a fixed length feature  vector. When it comes to texts, one of the most  common representations is bag-of-words. Despite their popularity, bag-of-words models have  two major weaknesses: they lose the ordering  of the words and they also ignore semantics of  the words. For example, "powerful," "strong"  and "Paris" are equally distant. In this paper,  we propose an unsupervised algorithm that learns  vector representations of sentences and text documents. This algorithm represents each document by a dense vector which is trained to predict  words in the document. Its construction gives our  algorithm the potential to overcome the weaknesses of bag-of-words models. Empirical results show that our technique outperforms bag-of-words models as well as other techniques for  text representations. Finally, we achieve new  state-of-the-art results on several text classification and sentiment analysis tasks.

RIS


TY  - CPAPER
TI  - Distributed Representations of Sentences and Documents
AU  - Quoc Le
AU  - Tomas Mikolov
BT  - Proceedings of the 31st International Conference on Machine Learning
DA  - 2014/06/18
ED  - Eric P. Xing
ED  - Tony Jebara	
ID  - pmlr-v32-le14
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 32
IS  - 2
SP  - 1188
EP  - 1196
L1  - http://proceedings.mlr.press/v32/le14.pdf
UR  - https://proceedings.mlr.press/v32/le14.html
AB  - Many machine learning algorithms require the  input to be represented as a fixed length feature  vector. When it comes to texts, one of the most  common representations is bag-of-words. Despite their popularity, bag-of-words models have  two major weaknesses: they lose the ordering  of the words and they also ignore semantics of  the words. For example, "powerful," "strong"  and "Paris" are equally distant. In this paper,  we propose an unsupervised algorithm that learns  vector representations of sentences and text documents. This algorithm represents each document by a dense vector which is trained to predict  words in the document. Its construction gives our  algorithm the potential to overcome the weaknesses of bag-of-words models. Empirical results show that our technique outperforms bag-of-words models as well as other techniques for  text representations. Finally, we achieve new  state-of-the-art results on several text classification and sentiment analysis tasks.
ER  -

APA


Le, Q. & Mikolov, T.. (2014). Distributed Representations of Sentences and Documents. Proceedings of the 31st International Conference on Machine Learning, in Proceedings of Machine Learning Research 32(2):1188-1196 Available from https://proceedings.mlr.press/v32/le14.html.

Related Material

Download PDF