Multimodal Neural Language Models

Ryan Kiros, Ruslan Salakhutdinov, Rich Zemel
Proceedings of the 31st International Conference on Machine Learning, PMLR 32(2):595-603, 2014.

Abstract

We introduce two multimodal neural language models: models of natural language that can be conditioned on other modalities. An image-text multimodal neural language model can be used to retrieve images given complex sentence queries, retrieve phrase descriptions given image queries, as well as generate text conditioned on images. We show that in the case of image-text modelling we can jointly learn word representations and image features by training our models together with a convolutional network. Unlike many of the existing methods, our approach can generate sentence descriptions for images without the use of templates, structured prediction, and/or syntactic trees. While we focus on image-text modelling, our algorithms can be easily applied to other modalities such as audio.

Cite this Paper


BibTeX
@InProceedings{pmlr-v32-kiros14, title = {Multimodal Neural Language Models}, author = {Kiros, Ryan and Salakhutdinov, Ruslan and Zemel, Rich}, booktitle = {Proceedings of the 31st International Conference on Machine Learning}, pages = {595--603}, year = {2014}, editor = {Xing, Eric P. and Jebara, Tony}, volume = {32}, number = {2}, series = {Proceedings of Machine Learning Research}, address = {Bejing, China}, month = {22--24 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v32/kiros14.pdf}, url = {https://proceedings.mlr.press/v32/kiros14.html}, abstract = {We introduce two multimodal neural language models: models of natural language that can be conditioned on other modalities. An image-text multimodal neural language model can be used to retrieve images given complex sentence queries, retrieve phrase descriptions given image queries, as well as generate text conditioned on images. We show that in the case of image-text modelling we can jointly learn word representations and image features by training our models together with a convolutional network. Unlike many of the existing methods, our approach can generate sentence descriptions for images without the use of templates, structured prediction, and/or syntactic trees. While we focus on image-text modelling, our algorithms can be easily applied to other modalities such as audio.} }
Endnote
%0 Conference Paper %T Multimodal Neural Language Models %A Ryan Kiros %A Ruslan Salakhutdinov %A Rich Zemel %B Proceedings of the 31st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2014 %E Eric P. Xing %E Tony Jebara %F pmlr-v32-kiros14 %I PMLR %P 595--603 %U https://proceedings.mlr.press/v32/kiros14.html %V 32 %N 2 %X We introduce two multimodal neural language models: models of natural language that can be conditioned on other modalities. An image-text multimodal neural language model can be used to retrieve images given complex sentence queries, retrieve phrase descriptions given image queries, as well as generate text conditioned on images. We show that in the case of image-text modelling we can jointly learn word representations and image features by training our models together with a convolutional network. Unlike many of the existing methods, our approach can generate sentence descriptions for images without the use of templates, structured prediction, and/or syntactic trees. While we focus on image-text modelling, our algorithms can be easily applied to other modalities such as audio.
RIS
TY - CPAPER TI - Multimodal Neural Language Models AU - Ryan Kiros AU - Ruslan Salakhutdinov AU - Rich Zemel BT - Proceedings of the 31st International Conference on Machine Learning DA - 2014/06/18 ED - Eric P. Xing ED - Tony Jebara ID - pmlr-v32-kiros14 PB - PMLR DP - Proceedings of Machine Learning Research VL - 32 IS - 2 SP - 595 EP - 603 L1 - http://proceedings.mlr.press/v32/kiros14.pdf UR - https://proceedings.mlr.press/v32/kiros14.html AB - We introduce two multimodal neural language models: models of natural language that can be conditioned on other modalities. An image-text multimodal neural language model can be used to retrieve images given complex sentence queries, retrieve phrase descriptions given image queries, as well as generate text conditioned on images. We show that in the case of image-text modelling we can jointly learn word representations and image features by training our models together with a convolutional network. Unlike many of the existing methods, our approach can generate sentence descriptions for images without the use of templates, structured prediction, and/or syntactic trees. While we focus on image-text modelling, our algorithms can be easily applied to other modalities such as audio. ER -
APA
Kiros, R., Salakhutdinov, R. & Zemel, R.. (2014). Multimodal Neural Language Models. Proceedings of the 31st International Conference on Machine Learning, in Proceedings of Machine Learning Research 32(2):595-603 Available from https://proceedings.mlr.press/v32/kiros14.html.

Related Material