Multimodal Neural Language Models

Ryan Kiros; Ruslan Salakhutdinov; Rich Zemel

Multimodal Neural Language Models

Ryan Kiros, Ruslan Salakhutdinov, Rich Zemel

Proceedings of the 31st International Conference on Machine Learning, PMLR 32(2):595-603, 2014.

Abstract

We introduce two multimodal neural language models: models of natural language that can be conditioned on other modalities. An image-text multimodal neural language model can be used to retrieve images given complex sentence queries, retrieve phrase descriptions given image queries, as well as generate text conditioned on images. We show that in the case of image-text modelling we can jointly learn word representations and image features by training our models together with a convolutional network. Unlike many of the existing methods, our approach can generate sentence descriptions for images without the use of templates, structured prediction, and/or syntactic trees. While we focus on image-text modelling, our algorithms can be easily applied to other modalities such as audio.

Cite this Paper

BibTeX


@InProceedings{pmlr-v32-kiros14,
  title = 	 {Multimodal Neural Language Models},
  author = 	 {Kiros, Ryan and Salakhutdinov, Ruslan and Zemel, Rich},
  booktitle = 	 {Proceedings of the 31st International Conference on Machine Learning},
  pages = 	 {595--603},
  year = 	 {2014},
  editor = 	 {Xing, Eric P. and Jebara, Tony},
  volume = 	 {32},
  number =       {2},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Bejing, China},
  month = 	 {22--24 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v32/kiros14.pdf},
  url = 	 {https://proceedings.mlr.press/v32/kiros14.html},
  abstract = 	 {We introduce two multimodal neural language models: models of natural language that can be conditioned on other modalities. An image-text multimodal neural language model can be used to retrieve images given complex sentence queries, retrieve phrase descriptions given image queries, as well as generate text conditioned on images. We show that in the case of image-text modelling we can jointly learn word representations and image features by training our models together with a convolutional network. Unlike many of the existing methods, our approach can generate sentence descriptions for images without the use of templates, structured prediction, and/or syntactic trees. While we focus on image-text modelling, our algorithms can be easily applied to other modalities such as audio.}
}

Endnote

%0 Conference Paper
%T Multimodal Neural Language Models
%A Ryan Kiros
%A Ruslan Salakhutdinov
%A Rich Zemel
%B Proceedings of the 31st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2014
%E Eric P. Xing
%E Tony Jebara	
%F pmlr-v32-kiros14
%I PMLR
%P 595--603
%U https://proceedings.mlr.press/v32/kiros14.html
%V 32
%N 2
%X We introduce two multimodal neural language models: models of natural language that can be conditioned on other modalities. An image-text multimodal neural language model can be used to retrieve images given complex sentence queries, retrieve phrase descriptions given image queries, as well as generate text conditioned on images. We show that in the case of image-text modelling we can jointly learn word representations and image features by training our models together with a convolutional network. Unlike many of the existing methods, our approach can generate sentence descriptions for images without the use of templates, structured prediction, and/or syntactic trees. While we focus on image-text modelling, our algorithms can be easily applied to other modalities such as audio.

RIS


TY  - CPAPER
TI  - Multimodal Neural Language Models
AU  - Ryan Kiros
AU  - Ruslan Salakhutdinov
AU  - Rich Zemel
BT  - Proceedings of the 31st International Conference on Machine Learning
DA  - 2014/06/18
ED  - Eric P. Xing
ED  - Tony Jebara	
ID  - pmlr-v32-kiros14
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 32
IS  - 2
SP  - 595
EP  - 603
L1  - http://proceedings.mlr.press/v32/kiros14.pdf
UR  - https://proceedings.mlr.press/v32/kiros14.html
AB  - We introduce two multimodal neural language models: models of natural language that can be conditioned on other modalities. An image-text multimodal neural language model can be used to retrieve images given complex sentence queries, retrieve phrase descriptions given image queries, as well as generate text conditioned on images. We show that in the case of image-text modelling we can jointly learn word representations and image features by training our models together with a convolutional network. Unlike many of the existing methods, our approach can generate sentence descriptions for images without the use of templates, structured prediction, and/or syntactic trees. While we focus on image-text modelling, our algorithms can be easily applied to other modalities such as audio.
ER  -

APA


Kiros, R., Salakhutdinov, R. & Zemel, R.. (2014). Multimodal Neural Language Models. Proceedings of the 31st International Conference on Machine Learning, in Proceedings of Machine Learning Research 32(2):595-603 Available from https://proceedings.mlr.press/v32/kiros14.html.

Multimodal Neural Language Models

Abstract

Cite this Paper

Related Material