BilBOWA: Fast Bilingual Distributed Representations without Word Alignments

Stephan Gouws, Yoshua Bengio, Greg Corrado
Proceedings of the 32nd International Conference on Machine Learning, PMLR 37:748-756, 2015.

Abstract

We introduce BilBOWA (Bilingual Bag-of-Words without Alignments), a simple and computationally-efficient model for learning bilingual distributed representations of words which can scale to large monolingual datasets and does not require word-aligned parallel training data. Instead it trains directly on monolingual data and extracts a bilingual signal from a smaller set of raw-text sentence-aligned data. This is achieved using a novel sampled bag-of-words cross-lingual objective, which is used to regularize two noise-contrastive language models for efficient cross-lingual feature learning. We show that bilingual embeddings learned using the proposed model outperforms state-of-the-art methods on a cross-lingual document classification task as well as a lexical translation task on the WMT11 data.

Cite this Paper


BibTeX
@InProceedings{pmlr-v37-gouws15, title = {BilBOWA: Fast Bilingual Distributed Representations without Word Alignments}, author = {Gouws, Stephan and Bengio, Yoshua and Corrado, Greg}, booktitle = {Proceedings of the 32nd International Conference on Machine Learning}, pages = {748--756}, year = {2015}, editor = {Bach, Francis and Blei, David}, volume = {37}, series = {Proceedings of Machine Learning Research}, address = {Lille, France}, month = {07--09 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v37/gouws15.pdf}, url = {https://proceedings.mlr.press/v37/gouws15.html}, abstract = {We introduce BilBOWA (Bilingual Bag-of-Words without Alignments), a simple and computationally-efficient model for learning bilingual distributed representations of words which can scale to large monolingual datasets and does not require word-aligned parallel training data. Instead it trains directly on monolingual data and extracts a bilingual signal from a smaller set of raw-text sentence-aligned data. This is achieved using a novel sampled bag-of-words cross-lingual objective, which is used to regularize two noise-contrastive language models for efficient cross-lingual feature learning. We show that bilingual embeddings learned using the proposed model outperforms state-of-the-art methods on a cross-lingual document classification task as well as a lexical translation task on the WMT11 data.} }
Endnote
%0 Conference Paper %T BilBOWA: Fast Bilingual Distributed Representations without Word Alignments %A Stephan Gouws %A Yoshua Bengio %A Greg Corrado %B Proceedings of the 32nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2015 %E Francis Bach %E David Blei %F pmlr-v37-gouws15 %I PMLR %P 748--756 %U https://proceedings.mlr.press/v37/gouws15.html %V 37 %X We introduce BilBOWA (Bilingual Bag-of-Words without Alignments), a simple and computationally-efficient model for learning bilingual distributed representations of words which can scale to large monolingual datasets and does not require word-aligned parallel training data. Instead it trains directly on monolingual data and extracts a bilingual signal from a smaller set of raw-text sentence-aligned data. This is achieved using a novel sampled bag-of-words cross-lingual objective, which is used to regularize two noise-contrastive language models for efficient cross-lingual feature learning. We show that bilingual embeddings learned using the proposed model outperforms state-of-the-art methods on a cross-lingual document classification task as well as a lexical translation task on the WMT11 data.
RIS
TY - CPAPER TI - BilBOWA: Fast Bilingual Distributed Representations without Word Alignments AU - Stephan Gouws AU - Yoshua Bengio AU - Greg Corrado BT - Proceedings of the 32nd International Conference on Machine Learning DA - 2015/06/01 ED - Francis Bach ED - David Blei ID - pmlr-v37-gouws15 PB - PMLR DP - Proceedings of Machine Learning Research VL - 37 SP - 748 EP - 756 L1 - http://proceedings.mlr.press/v37/gouws15.pdf UR - https://proceedings.mlr.press/v37/gouws15.html AB - We introduce BilBOWA (Bilingual Bag-of-Words without Alignments), a simple and computationally-efficient model for learning bilingual distributed representations of words which can scale to large monolingual datasets and does not require word-aligned parallel training data. Instead it trains directly on monolingual data and extracts a bilingual signal from a smaller set of raw-text sentence-aligned data. This is achieved using a novel sampled bag-of-words cross-lingual objective, which is used to regularize two noise-contrastive language models for efficient cross-lingual feature learning. We show that bilingual embeddings learned using the proposed model outperforms state-of-the-art methods on a cross-lingual document classification task as well as a lexical translation task on the WMT11 data. ER -
APA
Gouws, S., Bengio, Y. & Corrado, G.. (2015). BilBOWA: Fast Bilingual Distributed Representations without Word Alignments. Proceedings of the 32nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 37:748-756 Available from https://proceedings.mlr.press/v37/gouws15.html.

Related Material