Learning K-way D-dimensional Discrete Codes for Compact Embedding Representations

Ting Chen, Martin Renqiang Min, Yizhou Sun
Proceedings of the 35th International Conference on Machine Learning, PMLR 80:854-863, 2018.

Abstract

Conventional embedding methods directly associate each symbol with a continuous embedding vector, which is equivalent to applying a linear transformation based on a “one-hot” encoding of the discrete symbols. Despite its simplicity, such approach yields the number of parameters that grows linearly with the vocabulary size and can lead to overfitting. In this work, we propose a much more compact K-way D-dimensional discrete encoding scheme to replace the “one-hot" encoding. In the proposed “KD encoding”, each symbol is represented by a $D$-dimensional code with a cardinality of $K$, and the final symbol embedding vector is generated by composing the code embedding vectors. To end-to-end learn semantically meaningful codes, we derive a relaxed discrete optimization approach based on stochastic gradient descent, which can be generally applied to any differentiable computational graph with an embedding layer. In our experiments with various applications from natural language processing to graph convolutional networks, the total size of the embedding layer can be reduced up to 98% while achieving similar or better performance.

Cite this Paper


BibTeX
@InProceedings{pmlr-v80-chen18g, title = {Learning K-way D-dimensional Discrete Codes for Compact Embedding Representations}, author = {Chen, Ting and Min, Martin Renqiang and Sun, Yizhou}, booktitle = {Proceedings of the 35th International Conference on Machine Learning}, pages = {854--863}, year = {2018}, editor = {Dy, Jennifer and Krause, Andreas}, volume = {80}, series = {Proceedings of Machine Learning Research}, month = {10--15 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v80/chen18g/chen18g.pdf}, url = {http://proceedings.mlr.press/v80/chen18g.html}, abstract = {Conventional embedding methods directly associate each symbol with a continuous embedding vector, which is equivalent to applying a linear transformation based on a “one-hot” encoding of the discrete symbols. Despite its simplicity, such approach yields the number of parameters that grows linearly with the vocabulary size and can lead to overfitting. In this work, we propose a much more compact K-way D-dimensional discrete encoding scheme to replace the “one-hot" encoding. In the proposed “KD encoding”, each symbol is represented by a $D$-dimensional code with a cardinality of $K$, and the final symbol embedding vector is generated by composing the code embedding vectors. To end-to-end learn semantically meaningful codes, we derive a relaxed discrete optimization approach based on stochastic gradient descent, which can be generally applied to any differentiable computational graph with an embedding layer. In our experiments with various applications from natural language processing to graph convolutional networks, the total size of the embedding layer can be reduced up to 98% while achieving similar or better performance.} }
Endnote
%0 Conference Paper %T Learning K-way D-dimensional Discrete Codes for Compact Embedding Representations %A Ting Chen %A Martin Renqiang Min %A Yizhou Sun %B Proceedings of the 35th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2018 %E Jennifer Dy %E Andreas Krause %F pmlr-v80-chen18g %I PMLR %P 854--863 %U http://proceedings.mlr.press/v80/chen18g.html %V 80 %X Conventional embedding methods directly associate each symbol with a continuous embedding vector, which is equivalent to applying a linear transformation based on a “one-hot” encoding of the discrete symbols. Despite its simplicity, such approach yields the number of parameters that grows linearly with the vocabulary size and can lead to overfitting. In this work, we propose a much more compact K-way D-dimensional discrete encoding scheme to replace the “one-hot" encoding. In the proposed “KD encoding”, each symbol is represented by a $D$-dimensional code with a cardinality of $K$, and the final symbol embedding vector is generated by composing the code embedding vectors. To end-to-end learn semantically meaningful codes, we derive a relaxed discrete optimization approach based on stochastic gradient descent, which can be generally applied to any differentiable computational graph with an embedding layer. In our experiments with various applications from natural language processing to graph convolutional networks, the total size of the embedding layer can be reduced up to 98% while achieving similar or better performance.
APA
Chen, T., Min, M.R. & Sun, Y.. (2018). Learning K-way D-dimensional Discrete Codes for Compact Embedding Representations. Proceedings of the 35th International Conference on Machine Learning, in Proceedings of Machine Learning Research 80:854-863 Available from http://proceedings.mlr.press/v80/chen18g.html.

Related Material