BilBOWA: Fast Bilingual Distributed Representations without Word Alignments

Stephan Gouws; Yoshua Bengio; Greg Corrado

BilBOWA: Fast Bilingual Distributed Representations without Word Alignments

Stephan Gouws, Yoshua Bengio, Greg Corrado

Proceedings of the 32nd International Conference on Machine Learning, PMLR 37:748-756, 2015.

Abstract

We introduce BilBOWA (Bilingual Bag-of-Words without Alignments), a simple and computationally-efficient model for learning bilingual distributed representations of words which can scale to large monolingual datasets and does not require word-aligned parallel training data. Instead it trains directly on monolingual data and extracts a bilingual signal from a smaller set of raw-text sentence-aligned data. This is achieved using a novel sampled bag-of-words cross-lingual objective, which is used to regularize two noise-contrastive language models for efficient cross-lingual feature learning. We show that bilingual embeddings learned using the proposed model outperforms state-of-the-art methods on a cross-lingual document classification task as well as a lexical translation task on the WMT11 data.

Cite this Paper

BibTeX


@InProceedings{pmlr-v37-gouws15,
  title = 	 {BilBOWA: Fast Bilingual Distributed Representations without Word Alignments},
  author = 	 {Gouws, Stephan and Bengio, Yoshua and Corrado, Greg},
  booktitle = 	 {Proceedings of the 32nd International Conference on Machine Learning},
  pages = 	 {748--756},
  year = 	 {2015},
  editor = 	 {Bach, Francis and Blei, David},
  volume = 	 {37},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Lille, France},
  month = 	 {07--09 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v37/gouws15.pdf},
  url = 	 {https://proceedings.mlr.press/v37/gouws15.html},
  abstract = 	 {We introduce BilBOWA (Bilingual Bag-of-Words without Alignments), a simple and computationally-efficient model for learning bilingual distributed representations of words which can scale to large monolingual datasets and does not require word-aligned parallel training data. Instead it trains directly on monolingual data and extracts a bilingual signal from a smaller set of raw-text sentence-aligned data. This is achieved using a novel sampled bag-of-words cross-lingual objective, which is used to regularize two noise-contrastive language models for efficient cross-lingual feature learning. We show that bilingual embeddings learned using the proposed model outperforms state-of-the-art methods on a cross-lingual document classification task as well as a lexical translation task on the WMT11 data.}
}

Endnote

%0 Conference Paper
%T BilBOWA: Fast Bilingual Distributed Representations without Word Alignments
%A Stephan Gouws
%A Yoshua Bengio
%A Greg Corrado
%B Proceedings of the 32nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2015
%E Francis Bach
%E David Blei	
%F pmlr-v37-gouws15
%I PMLR
%P 748--756
%U https://proceedings.mlr.press/v37/gouws15.html
%V 37
%X We introduce BilBOWA (Bilingual Bag-of-Words without Alignments), a simple and computationally-efficient model for learning bilingual distributed representations of words which can scale to large monolingual datasets and does not require word-aligned parallel training data. Instead it trains directly on monolingual data and extracts a bilingual signal from a smaller set of raw-text sentence-aligned data. This is achieved using a novel sampled bag-of-words cross-lingual objective, which is used to regularize two noise-contrastive language models for efficient cross-lingual feature learning. We show that bilingual embeddings learned using the proposed model outperforms state-of-the-art methods on a cross-lingual document classification task as well as a lexical translation task on the WMT11 data.

RIS


TY  - CPAPER
TI  - BilBOWA: Fast Bilingual Distributed Representations without Word Alignments
AU  - Stephan Gouws
AU  - Yoshua Bengio
AU  - Greg Corrado
BT  - Proceedings of the 32nd International Conference on Machine Learning
DA  - 2015/06/01
ED  - Francis Bach
ED  - David Blei	
ID  - pmlr-v37-gouws15
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 37
SP  - 748
EP  - 756
L1  - http://proceedings.mlr.press/v37/gouws15.pdf
UR  - https://proceedings.mlr.press/v37/gouws15.html
AB  - We introduce BilBOWA (Bilingual Bag-of-Words without Alignments), a simple and computationally-efficient model for learning bilingual distributed representations of words which can scale to large monolingual datasets and does not require word-aligned parallel training data. Instead it trains directly on monolingual data and extracts a bilingual signal from a smaller set of raw-text sentence-aligned data. This is achieved using a novel sampled bag-of-words cross-lingual objective, which is used to regularize two noise-contrastive language models for efficient cross-lingual feature learning. We show that bilingual embeddings learned using the proposed model outperforms state-of-the-art methods on a cross-lingual document classification task as well as a lexical translation task on the WMT11 data.
ER  -

APA


Gouws, S., Bengio, Y. & Corrado, G.. (2015). BilBOWA: Fast Bilingual Distributed Representations without Word Alignments. Proceedings of the 32nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 37:748-756 Available from https://proceedings.mlr.press/v37/gouws15.html.

Related Material

Download PDF